Fault-Tolerant MapReduce-MPI for HPC Clusters: Enhancing Fault Tolerance in High-Performance Computing

undefined
 
Fault Tolerant MapReduce-MPI for
HPC Clusters
 
Yanfei Guo
*
, Wesley Bland
+
, Pavan Balaji
+
, Xiaobo Zhou
*
*
Dept. of Computer Science, University of Colorado, Colorado Springs
+
Mathematics and Computer Science Division, Argonne National Laboratory
 
Outline
 
Overview
Backgrounds
Challenges
FT-MRMPI
Design
Checkpoint-Restart
Detect-Resume
Evaluation
Conclusion
 
FT-MRMPI for HPC Clusters, SC15
 
2
MapReduce on HPC Clusters
What MapReduce Provides
Write serial code and run parallelly
Reliable execution with 
detect-restart
 fault
tolerance model
HPC Clusters
High Performance CPU, Storage, Network
MapReduce on HPC Clusters
High Performance Big Data Analytics
Reduced Data Movements between Systems
FT-MRMPI for HPC Clusters, SC15
3
MapReduce on HPC software stack
With Fault
Tolerance
Fault Tolerance Model of MapReduce
Master/Worker Model
Detect: Master monitors the all workers
Restart: Affect tasks are rescheduled to another worker
FT-MRMPI for HPC Clusters, SC15
4
Job
MapTask
Worker
Worker
 
No Fault Tolerance in MPI
 
MPI: Message Passing Interface
Inter-process Communication
Communicator (COMM)
Frequent Failures at Large Scale
MTTF=4.2 hr (NCSA Blue Waters)
MTTF<1 hr in future
MPI Standard 3.1
Custom Error Handler
No guarantee that all processes go into the error handler
No fix for a broken COMM
 
FT-MRMPI for HPC Clusters, SC15
 
5
Scheduling Restrictions
Gang Scheduling
Scheduler all processes at the same time
Preferred by HPC application with extensive synchronizations
MapReduce Scheduling
Per-task scheduling
Schedule each task as early as possible
Compatible with the 
detect-restart
 fault tolerance model
Resizing a Running Job
Many platform does not support
Large overhead (re-queueing)
The detect-restart fault tolerance model is not compatible with HPC schedulers
FT-MRMPI for HPC Clusters, SC15
6
MapReduce Job
 
Overall Design
 
Fault Tolerant MapReduce using MPI
Reliable Failure Detection and Propagation
Compatible Fault Tolerance Model
FT-MRMPI
Task Runner
Distributed Master & Load Balancer
Failure Handler
Features
Tracable Job Interfaces
HPC Scheduler Compatible Fault Tolerance Models
Checkpoint-Restart
Detect-Resume
 
 
 
FT-MRMPI for HPC Clusters, SC15
 
7
MPI
 
 
Task Runner
 
Tracing, Establish Consistent States
Delegating Operations to the Library
New Interface
Highly Extensible
Embedded Tracing
Record Level Consistency
 
FT-MRMPI for HPC Clusters, SC15
 
8
 
template
 <
typename 
K, 
typename 
V> 
class
 
RecordReader
 
template
 <
typename 
K, 
typename 
V> 
class
 
RecordWriter
 
 
class
 
WordRecordReader
 : 
public
 
RecordReader
<int, string>
 
 
template
 <
typename 
K, 
typename 
V> 
class
 
Mapper
 
template
 <
typename 
K, 
typename 
V> 
class
 
Reducer
 
void
 
Mapper::map
(
int&
 key, string& value,
 
                 BaseRecordWriter* out, 
void
* param) {
 
out->add(value, 
1
);
 
}
 
 
int main(int narg, char** args) {
 
MPI_Init
(&narg,&args);
 
<***snip***>
 
mr->
map
(new 
WCMapper
(),
 
        new 
WCReader
(),
 
        
NULL
, NULL);
 
mr->
collate
(NULL);
 
mr->
reduce
(new 
WCReducer
(), 
NULL
,
 
           new 
WCWriter
(), NULL);
 
<***snip***>
 
}
 
Distributed Master & Load Balancer
 
Task Dispatching
Global Task Pool
Job Init
Recovery
Global Consistent State
Shuffle Buffer Tracing
Load Balancing
Monitoring Processing Speed of Tasks
Linear Job Performance Model
 
FT-MRMPI for HPC Clusters, SC15
 
9
MapReduce Job
 
Task Pool
 
Fault Tolerance Model: Checkpoint-Restart
 
Custom Error Handler
Save and exit gracefully
Propagate failure event with MPI_Abort()
Checkpoint
Asynchronous in phase
Saved locally
Multiple granularity
Restart to Recover
Resubmit w/ -recover
Pickup from where it left
 
FT-MRMPI for HPC Clusters, SC15
 
10
 
RD record
 
Where to Write Checkpoint
 
Write to GPFS
Performance
 issue due to small I/O
Interferences on shared hardware
Write to Node Local Disk
Fast,
 no interferences
Global availability in recovery?
Background Data Copier
Write local
Sync to GPFS in background
Overlapping I/O w/ computation
 
Wordcount 100GB,
256 procs, ppn=8
 
FT-MRMPI for HPC Clusters, SC15
 
11
 
Recover Point
 
Recover to
 Last File (ft-file)
Less frequent checkpoint
Need reprocess when recover, lost some work
Recover to Last Record (ft-rec)
Require fine grained checkpoint
Skipping records than reprocessing
 
wordcount
 
pagerank
 
FT-MRMPI for HPC Clusters, SC15
 
12
 
Drawbacks of Checkpoint-Restart
 
Checkpoint-Restart works, but not perfect
Large Overhead due to Read/Write Checkpoints
Requires Human intervention
Failure in Recovery
 
FT-MRMPI for HPC Clusters, SC15
 
13
Fault Tolerance Model: Detect-Resume
Detect
Global knowledge of failure
Identify failed processes by comparing groups
Resume
Fix COMM by excluding failed processes
Balanced distribution of 
affected tasks
Work-Conserving vs. Non-Work-Conserving
User Level Failure Mitigation (ULFM)
MPIX_Comm_revoke()
MPIX_Comm_shrink()
Map
Shuffle
Reduce
FT-MRMPI for HPC Clusters, SC15
14
 
Evaluation Setup
 
LCRC Fusion Cluster
[1]
256 nodes
CPU: 2-way 8-core Intel Xeon X5550
Memory: 36GB
Local Disk: 250 GB
Network: Mellanox Infiniband QDR
Benchmarks
Wordcount, BFS, Pagerank
mrmpiBLAST
 
[1] 
http://www.lcrc.anl.gov
 
FT-MRMPI for HPC Clusters, SC15
 
15
 
Job Performance
 
10%-13% overhead 
of checkpointing
Up to 
39% shorter 
completion time with failure
 
FT-MRMPI for HPC Clusters, SC15
 
16
 
Checkpoint Overhead
 
Factors
Granularity: number of records per checkpoint
Size of records
 
FT-MRMPI for HPC Clusters, SC15
 
17
 
Time Decomposition
 
Performance with failure and recovery
Wordcount,
 
All
 processes together
Detect-Recover has less data
 that needed to be recovered
 
FT-MRMPI for HPC Clusters, SC15
 
18
 
Continuous Failures
 
Pagerank
256 processes, randomly kill 1 process every 5 secs
 
FT-MRMPI for HPC Clusters, SC15
 
19
 
Conclusion
 
First Fault Tolerant MapReduce Implementation in MPI
Redesign MR-MPI to provide fault tolerance
Highly extensible while providing the essential features for FT
Two Fault Tolerance Model
Checkpoint-Restart
Detect-Resume
 
FT-MRMPI for HPC Clusters, SC15
 
20
undefined
 
Thank you!
 
Q & A
undefined
 
Backup Slides
 
 
Prefetching Data Copier
 
Recover
 from GPFS
Reading everything from GPFS
Processes wait for I/O
Prefetching in Recovery
Move
 from GPFS to local disk
Overlapping I/O with computation
2-Pass KV-KMV Conversion
4-Pass in MR-MPI
Excessive disk I/O when shuffle
Hard to make checkpoints
2-Pass KV-KMV Conversion
Log-Structure File System
KV->Sketch, Sketch->KMV
 
Recover Time
 
Recover from local, GPFS, GPFS w/ prefetching
 
FT-MRMPI for HPC Clusters, SC15
 
25
Slide Note
Embed
Share

This research discusses the design and implementation of FT-MRMPI for HPC clusters, focusing on fault tolerance and reliability in MapReduce applications. It addresses challenges, presents the fault tolerance model, and highlights the differences in fault tolerance between MapReduce and MPI. The study also delves into scheduling restrictions and the compatibility of detect-restart fault tolerance models with HPC schedulers.

  • Fault Tolerance
  • MapReduce
  • MPI
  • HPC Clusters
  • High-Performance Computing

Uploaded on Oct 09, 2024 | 2 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Fault Tolerant MapReduce-MPI for HPC Clusters Yanfei Guo*, Wesley Bland+, Pavan Balaji+, Xiaobo Zhou* *Dept. of Computer Science, University of Colorado, Colorado Springs +Mathematics and Computer Science Division, Argonne National Laboratory

  2. Outline Overview Backgrounds Challenges FT-MRMPI Design Checkpoint-Restart Detect-Resume Evaluation Conclusion 2 FT-MRMPI for HPC Clusters, SC15

  3. MapReduce on HPC Clusters What MapReduce Provides Mira Hadoop CPU 16 1.6GHz PPC A2 cores 16 2.4GHz Intel Xeon cores Write serial code and run parallelly Reliable execution with detect-restart fault tolerance model Memory 16 GB (1 GB/core) 64-128 GB (4-8 GB/core) HPC Clusters Storage Local: N/A Shared: 24 PB SAN Local: 500 GB x 8 Shared: N/A High Performance CPU, Storage, Network MapReduce on HPC Clusters Network 5D Torus 10/40 Gbps Ethernet High Performance Big Data Analytics Software Env MPI, Java, Reduced Data Movements between Systems File System GPFS HDFS Scheduler Cobalt Hadoop MapReduce on HPC software stack MapReduce Lib MapReduce-MPI Hadoop With Fault Tolerance 3 FT-MRMPI for HPC Clusters, SC15

  4. Fault Tolerance Model of MapReduce Master/Worker Model Detect: Master monitors the all workers Restart: Affect tasks are rescheduled to another worker Worker Job Worker Scheduler Reduce Slot MapTask Map Slot Master Worker MapTask MapTask MapTask ReduceTask ReduceTask ReduceTask MapTask 4 FT-MRMPI for HPC Clusters, SC15

  5. No Fault Tolerance in MPI MPI: Message Passing Interface Inter-process Communication Communicator (COMM) Frequent Failures at Large Scale MTTF=4.2 hr (NCSA Blue Waters) MTTF<1 hr in future MPI Standard 3.1 Custom Error Handler No guarantee that all processes go into the error handler No fix for a broken COMM 5 FT-MRMPI for HPC Clusters, SC15

  6. Scheduling Restrictions Gang Scheduling Scheduler all processes at the same time Preferred by HPC application with extensive synchronizations MapReduce Scheduling Per-task scheduling Schedule each task as early as possible Compatible with the detect-restart fault tolerance model Resizing a Running Job Many platform does not support Large overhead (re-queueing) The detect-restart fault tolerance model is not compatible with HPC schedulers 6 FT-MRMPI for HPC Clusters, SC15

  7. Overall Design Fault Tolerant MapReduce using MPI MapReduce Job Reliable Failure Detection and Propagation Distributed Master Distributed Master Compatible Fault Tolerance Model Task Runner Task Runner Load Balancer Load Balancer FT-MRMPI Task Runner Failure Hldr Failure Hldr MapReduce Processs MapReduce Processs Distributed Master & Load Balancer Failure Handler MPI Features Tracable Job Interfaces HPC Scheduler Compatible Fault Tolerance Models Checkpoint-Restart Detect-Resume 7 FT-MRMPI for HPC Clusters, SC15

  8. Task Runner User Program MR-MPI Tracing, Establish Consistent States Map() Delegating Operations to the Library RD record Call (*func)() New Interface Process Highly Extensible WR KV Embedded Tracing Record Level Consistency int main(int narg, char** args) { MPI_Init(&narg,&args); <***snip***> mr->map(new WCMapper(), new WCReader(), NULL, NULL); mr->collate(NULL); mr->reduce(new WCReducer(), NULL, new WCWriter(), NULL); <***snip***> } template <typename K, typename V> class RecordReader template <typename K, typename V> class RecordWriter class WordRecordReader : public RecordReader<int, string> template <typename K, typename V> class Mapper template <typename K, typename V> class Reducer void Mapper::map(int& key, string& value, BaseRecordWriter* out, void* param) { out->add(value, 1); } 8 FT-MRMPI for HPC Clusters, SC15

  9. Distributed Master & Load Balancer Task Dispatching MapReduce Job Global Task Pool Task Pool Job Init Task Task Task Task Task Task Task Task Task Task Task Task Recovery Global Consistent State Distributed Master Distributed Master Shuffle Buffer Tracing Load Balancing Task Runner Task Runner Load Balancer Load Balancer Monitoring Processing Speed of Tasks Failure Hldr Failure Hldr Linear Job Performance Model MapReduce Processs MapReduce Processs 9 FT-MRMPI for HPC Clusters, SC15

  10. Fault Tolerance Model: Checkpoint-Restart RD record Failed Process Normal Process Other Processes Custom Error Handler Save and exit gracefully Err Hldr Save States Propagate failure event with MPI_Abort() MPI_Abort() Checkpoint Asynchronous in phase Saved locally Multiple granularity Restart to Recover Resubmit w/ -recover Pickup from where it left Map Shuffle Reduce 10 FT-MRMPI for HPC Clusters, SC15

  11. Where to Write Checkpoint Write to GPFS Performance issue due to small I/O 1400 Interferences on shared hardware 1200 Job Completion Time (s) Write to Node Local Disk 1000 800 Fast, no interferences 600 Global availability in recovery? 400 Background Data Copier 200 0 Write local GPFS Local Copier Sync to GPFS in background Wordcount 100GB, 256 procs, ppn=8 Overlapping I/O w/ computation 11 FT-MRMPI for HPC Clusters, SC15

  12. Recover Point Recover to Last File (ft-file) Less frequent checkpoint Need reprocess when recover, lost some work Recover to Last Record (ft-rec) Require fine grained checkpoint Skipping records than reprocessing 300 300 Job Recover Time (s) Job Recover Time (s) 250 250 200 200 150 150 100 100 50 50 0 0 ft-rec ft-file ft-rec ft-file Init Recover Runtime wordcount Skip/Reprocess Init Recover Runtime pagerank Skip/Reprocess 12 FT-MRMPI for HPC Clusters, SC15

  13. Drawbacks of Checkpoint-Restart Checkpoint-Restart works, but not perfect Large Overhead due to Read/Write Checkpoints Requires Human intervention Failure in Recovery 13 FT-MRMPI for HPC Clusters, SC15

  14. Fault Tolerance Model: Detect-Resume Failed Process Normal Process Other Processes Detect Global knowledge of failure Err Hldr Revoke() Identify failed processes by comparing groups Shrink() Resume Fix COMM by excluding failed processes Balanced distribution of affected tasks Work-Conserving vs. Non-Work-Conserving User Level Failure Mitigation (ULFM) MPIX_Comm_revoke() MPIX_Comm_shrink() Map Shuffle Reduce 14 FT-MRMPI for HPC Clusters, SC15

  15. Evaluation Setup LCRC Fusion Cluster[1] 256 nodes CPU: 2-way 8-core Intel Xeon X5550 Memory: 36GB Local Disk: 250 GB Network: Mellanox Infiniband QDR Benchmarks Wordcount, BFS, Pagerank mrmpiBLAST [1] http://www.lcrc.anl.gov 15 FT-MRMPI for HPC Clusters, SC15

  16. Job Performance 10%-13% overhead of checkpointing Up to 39% shorter completion time with failure Job Completion Time (s) Job Completion Time (s) 500 1000 50 100 32 64 128 256 512 1024 2048 32 64 128 256 512 1024 2048 mrmpi checkpoint-restart detect-resume (WC) detect-resume (NWC) MR-MPI checkpoint-restart detect-resume (WC) detect-resume (NWC) 16 FT-MRMPI for HPC Clusters, SC15

  17. Checkpoint Overhead Factors Granularity: number of records per checkpoint Size of records 60 50 Checkpoint Overhead (%) 40 30 20 10 0 1 10 100 1000 10000 100000 1000000 10000000 Number of Records per Checkpoint 17 FT-MRMPI for HPC Clusters, SC15

  18. Time Decomposition Performance with failure and recovery Wordcount, All processes together Detect-Recover has less data that needed to be recovered Checkpoint-Restart Detect-Resume 100% 100% 90% 90% 80% 80% Execution Time Percentage Execution Time Percentage 70% 70% 60% 60% 50% 50% 40% 40% 30% 30% 20% 20% 10% 10% 0% 0% 64 128 256 512 1024 2048 64 128 256 512 1024 2048 map recover aggregate convert reduce map recover aggregate convert reduce 18 FT-MRMPI for HPC Clusters, SC15

  19. Continuous Failures Pagerank 256 processes, randomly kill 1 process every 5 secs 1800 1600 1400 1200 1000 800 600 400 200 0 2 4 8 16 32 64 128 Work-Conserving Non-Work-Conserving Reference 19 FT-MRMPI for HPC Clusters, SC15

  20. Conclusion First Fault Tolerant MapReduce Implementation in MPI Redesign MR-MPI to provide fault tolerance Highly extensible while providing the essential features for FT Two Fault Tolerance Model Checkpoint-Restart Detect-Resume 20 FT-MRMPI for HPC Clusters, SC15

  21. Thank you! Q & A

  22. Backup Slides

  23. Prefetching Data Copier Recover from GPFS Reading everything from GPFS Processes wait for I/O Prefetching in Recovery Move from GPFS to local disk Overlapping I/O with computation

  24. 2-Pass KV-KMV Conversion 4-Pass in MR-MPI Excessive disk I/O when shuffle HashTable Keys Hard to make checkpoints 2-Pass KV-KMV Conversion Log-Structure File System KMV sketch KV KMV KV->Sketch, Sketch->KMV

  25. Recover Time Recover from local, GPFS, GPFS w/ prefetching 50 45 40 Recover Time (s) 35 30 25 20 15 10 5 0 64 128 256 512 1024 2048 local GPFS GPFS w/ prefetching 25 FT-MRMPI for HPC Clusters, SC15

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#