Fault-Tolerant MapReduce-MPI for HPC Clusters: Enhancing Fault Tolerance in High-Performance Computing

undefined

Fault Tolerant MapReduce-MPI for

HPC Clusters

Yanfei Guo

, Wesley Bland

, Pavan Balaji

, Xiaobo Zhou

Dept. of Computer Science, University of Colorado, Colorado Springs

Mathematics and Computer Science Division, Argonne National Laboratory

Outline



Overview



Backgrounds



Challenges



FT-MRMPI

–

Design

–

Checkpoint-Restart

–

Detect-Resume



Evaluation



Conclusion

FT-MRMPI for HPC Clusters, SC15

MapReduce on HPC Clusters



What MapReduce Provides

–

Write serial code and run parallelly

–

Reliable execution with

detect-restart

 fault

tolerance model



HPC Clusters

–

High Performance CPU, Storage, Network



MapReduce on HPC Clusters

–

High Performance Big Data Analytics

–

Reduced Data Movements between Systems

FT-MRMPI for HPC Clusters, SC15

MapReduce on HPC software stack

With Fault

Tolerance

Fault Tolerance Model of MapReduce



Master/Worker Model



Detect: Master monitors the all workers



Restart: Affect tasks are rescheduled to another worker

FT-MRMPI for HPC Clusters, SC15

Job

MapTask

Worker

Worker

…

No Fault Tolerance in MPI



MPI: Message Passing Interface

–

Inter-process Communication

–

Communicator (COMM)



Frequent Failures at Large Scale

–

MTTF=4.2 hr (NCSA Blue Waters)

–

MTTF<1 hr in future



MPI Standard 3.1

–

Custom Error Handler

–

No guarantee that all processes go into the error handler

–

No fix for a broken COMM

FT-MRMPI for HPC Clusters, SC15

Scheduling Restrictions



Gang Scheduling

–

Scheduler all processes at the same time

–

Preferred by HPC application with extensive synchronizations



MapReduce Scheduling

–

Per-task scheduling

–

Schedule each task as early as possible

–

Compatible with the

detect-restart

 fault tolerance model



Resizing a Running Job

–

Many platform does not support

–

Large overhead (re-queueing)

The detect-restart fault tolerance model is not compatible with HPC schedulers

FT-MRMPI for HPC Clusters, SC15

MapReduce Job

Overall Design



Fault Tolerant MapReduce using MPI

–

Reliable Failure Detection and Propagation

–

Compatible Fault Tolerance Model



FT-MRMPI

–

Task Runner

–

Distributed Master & Load Balancer

–

Failure Handler



Features

–

Tracable Job Interfaces

–

HPC Scheduler Compatible Fault Tolerance Models

•

Checkpoint-Restart

•

Detect-Resume

FT-MRMPI for HPC Clusters, SC15

MPI

…

Task Runner



Tracing, Establish Consistent States

–

Delegating Operations to the Library



New Interface

–

Highly Extensible

–

Embedded Tracing

–

Record Level Consistency

FT-MRMPI for HPC Clusters, SC15

template

typename

K,

typename

V>

class

RecordReader

template

typename

K,

typename

V>

class

RecordWriter

class

WordRecordReader

public

RecordReader

<int, string>

template

typename

K,

typename

V>

class

Mapper

template

typename

K,

typename

V>

class

Reducer

void

Mapper::map

int&

 key, string& value,

                 BaseRecordWriter* out,

void

* param) {

out->add(value,

);

int main(int narg, char** args) {

MPI_Init

(&narg,&args);

<***snip***>

mr->

map

(new

WCMapper

(),

new

WCReader

(),

NULL

, NULL);

mr->

collate

(NULL);

mr->

reduce

(new

WCReducer

(),

NULL

new

WCWriter

(), NULL);

<***snip***>

Distributed Master & Load Balancer



Task Dispatching

–

Global Task Pool

–

Job Init

–

Recovery



Global Consistent State

–

Shuffle Buffer Tracing



Load Balancing

–

Monitoring Processing Speed of Tasks

–

Linear Job Performance Model

FT-MRMPI for HPC Clusters, SC15

MapReduce Job

…

Task Pool

Fault Tolerance Model: Checkpoint-Restart



Custom Error Handler

–

Save and exit gracefully

–

Propagate failure event with MPI_Abort()



Checkpoint

–

Asynchronous in phase

–

Saved locally

–

Multiple granularity



Restart to Recover

–

Resubmit w/ -recover

–

Pickup from where it left

FT-MRMPI for HPC Clusters, SC15

RD record

Where to Write Checkpoint



Write to GPFS

–

Performance

 issue due to small I/O

–

Interferences on shared hardware



Write to Node Local Disk

–

Fast,

 no interferences

–

Global availability in recovery?



Background Data Copier

–

Write local

–

Sync to GPFS in background

–

Overlapping I/O w/ computation

Wordcount 100GB,

256 procs, ppn=8

FT-MRMPI for HPC Clusters, SC15

Recover Point



Recover to

 Last File (ft-file)

–

Less frequent checkpoint

–

Need reprocess when recover, lost some work



Recover to Last Record (ft-rec)

–

Require fine grained checkpoint

–

Skipping records than reprocessing

wordcount

pagerank

FT-MRMPI for HPC Clusters, SC15

Drawbacks of Checkpoint-Restart



Checkpoint-Restart works, but not perfect

–

Large Overhead due to Read/Write Checkpoints

–

Requires Human intervention

–

Failure in Recovery

FT-MRMPI for HPC Clusters, SC15

Fault Tolerance Model: Detect-Resume



Detect

–

Global knowledge of failure

–

Identify failed processes by comparing groups



Resume

–

Fix COMM by excluding failed processes

–

Balanced distribution of

affected tasks

–

Work-Conserving vs. Non-Work-Conserving



User Level Failure Mitigation (ULFM)

–

MPIX_Comm_revoke()

–

MPIX_Comm_shrink()

Map

Shuffle

Reduce

FT-MRMPI for HPC Clusters, SC15

Evaluation Setup



LCRC Fusion Cluster

[1]

–

256 nodes

–

CPU: 2-way 8-core Intel Xeon X5550

–

Memory: 36GB

–

Local Disk: 250 GB

–

Network: Mellanox Infiniband QDR



Benchmarks

–

Wordcount, BFS, Pagerank

–

mrmpiBLAST

[1]

http://www.lcrc.anl.gov

FT-MRMPI for HPC Clusters, SC15

Job Performance



10%-13% overhead

of checkpointing



Up to

39% shorter

completion time with failure

FT-MRMPI for HPC Clusters, SC15

Checkpoint Overhead



Factors

–

Granularity: number of records per checkpoint

–

Size of records

FT-MRMPI for HPC Clusters, SC15

Time Decomposition



Performance with failure and recovery

–

Wordcount,

All

 processes together

–

Detect-Recover has less data

 that needed to be recovered

FT-MRMPI for HPC Clusters, SC15

Continuous Failures



Pagerank

–

256 processes, randomly kill 1 process every 5 secs

FT-MRMPI for HPC Clusters, SC15

Conclusion



First Fault Tolerant MapReduce Implementation in MPI

–

Redesign MR-MPI to provide fault tolerance

–

Highly extensible while providing the essential features for FT



Two Fault Tolerance Model

–

Checkpoint-Restart

–

Detect-Resume

FT-MRMPI for HPC Clusters, SC15

undefined

Thank you!

Q & A

undefined

Backup Slides

Prefetching Data Copier



Recover

 from GPFS

–

Reading everything from GPFS

–

Processes wait for I/O



Prefetching in Recovery

–

Move

 from GPFS to local disk

–

Overlapping I/O with computation

2-Pass KV-KMV Conversion



4-Pass in MR-MPI

–

Excessive disk I/O when shuffle

–

Hard to make checkpoints



2-Pass KV-KMV Conversion

–

Log-Structure File System

–

KV->Sketch, Sketch->KMV

Recover Time



Recover from local, GPFS, GPFS w/ prefetching

FT-MRMPI for HPC Clusters, SC15

Slide Note

Embed Share

Download

This research discusses the design and implementation of FT-MRMPI for HPC clusters, focusing on fault tolerance and reliability in MapReduce applications. It addresses challenges, presents the fault tolerance model, and highlights the differences in fault tolerance between MapReduce and MPI. The study also delves into scheduling restrictions and the compatibility of detect-restart fault tolerance models with HPC schedulers.

ho_gan Follow

Uploaded on Oct 09, 2024 | 2 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Fault Tolerant MapReduce-MPI for HPC Clusters Yanfei Guo*, Wesley Bland+, Pavan Balaji+, Xiaobo Zhou* *Dept. of Computer Science, University of Colorado, Colorado Springs +Mathematics and Computer Science Division, Argonne National Laboratory

Outline Overview Backgrounds Challenges FT-MRMPI Design Checkpoint-Restart Detect-Resume Evaluation Conclusion 2 FT-MRMPI for HPC Clusters, SC15

MapReduce on HPC Clusters What MapReduce Provides Mira Hadoop CPU 16 1.6GHz PPC A2 cores 16 2.4GHz Intel Xeon cores Write serial code and run parallelly Reliable execution with detect-restart fault tolerance model Memory 16 GB (1 GB/core) 64-128 GB (4-8 GB/core) HPC Clusters Storage Local: N/A Shared: 24 PB SAN Local: 500 GB x 8 Shared: N/A High Performance CPU, Storage, Network MapReduce on HPC Clusters Network 5D Torus 10/40 Gbps Ethernet High Performance Big Data Analytics Software Env MPI, Java, Reduced Data Movements between Systems File System GPFS HDFS Scheduler Cobalt Hadoop MapReduce on HPC software stack MapReduce Lib MapReduce-MPI Hadoop With Fault Tolerance 3 FT-MRMPI for HPC Clusters, SC15

Fault Tolerance Model of MapReduce Master/Worker Model Detect: Master monitors the all workers Restart: Affect tasks are rescheduled to another worker Worker Job Worker Scheduler Reduce Slot MapTask Map Slot Master Worker MapTask MapTask MapTask ReduceTask ReduceTask ReduceTask MapTask 4 FT-MRMPI for HPC Clusters, SC15

No Fault Tolerance in MPI MPI: Message Passing Interface Inter-process Communication Communicator (COMM) Frequent Failures at Large Scale MTTF=4.2 hr (NCSA Blue Waters) MTTF<1 hr in future MPI Standard 3.1 Custom Error Handler No guarantee that all processes go into the error handler No fix for a broken COMM 5 FT-MRMPI for HPC Clusters, SC15

Scheduling Restrictions Gang Scheduling Scheduler all processes at the same time Preferred by HPC application with extensive synchronizations MapReduce Scheduling Per-task scheduling Schedule each task as early as possible Compatible with the detect-restart fault tolerance model Resizing a Running Job Many platform does not support Large overhead (re-queueing) The detect-restart fault tolerance model is not compatible with HPC schedulers 6 FT-MRMPI for HPC Clusters, SC15

Overall Design Fault Tolerant MapReduce using MPI MapReduce Job Reliable Failure Detection and Propagation Distributed Master Distributed Master Compatible Fault Tolerance Model Task Runner Task Runner Load Balancer Load Balancer FT-MRMPI Task Runner Failure Hldr Failure Hldr MapReduce Processs MapReduce Processs Distributed Master & Load Balancer Failure Handler MPI Features Tracable Job Interfaces HPC Scheduler Compatible Fault Tolerance Models Checkpoint-Restart Detect-Resume 7 FT-MRMPI for HPC Clusters, SC15

Task Runner User Program MR-MPI Tracing, Establish Consistent States Map() Delegating Operations to the Library RD record Call (*func)() New Interface Process Highly Extensible WR KV Embedded Tracing Record Level Consistency int main(int narg, char** args) { MPI_Init(&narg,&args); <***snip***> mr->map(new WCMapper(), new WCReader(), NULL, NULL); mr->collate(NULL); mr->reduce(new WCReducer(), NULL, new WCWriter(), NULL); <***snip***> } template <typename K, typename V> class RecordReader template <typename K, typename V> class RecordWriter class WordRecordReader : public RecordReader<int, string> template <typename K, typename V> class Mapper template <typename K, typename V> class Reducer void Mapper::map(int& key, string& value, BaseRecordWriter* out, void* param) { out->add(value, 1); } 8 FT-MRMPI for HPC Clusters, SC15

Distributed Master & Load Balancer Task Dispatching MapReduce Job Global Task Pool Task Pool Job Init Task Task Task Task Task Task Task Task Task Task Task Task Recovery Global Consistent State Distributed Master Distributed Master Shuffle Buffer Tracing Load Balancing Task Runner Task Runner Load Balancer Load Balancer Monitoring Processing Speed of Tasks Failure Hldr Failure Hldr Linear Job Performance Model MapReduce Processs MapReduce Processs 9 FT-MRMPI for HPC Clusters, SC15

Fault Tolerance Model: Checkpoint-Restart RD record Failed Process Normal Process Other Processes Custom Error Handler Save and exit gracefully Err Hldr Save States Propagate failure event with MPI_Abort() MPI_Abort() Checkpoint Asynchronous in phase Saved locally Multiple granularity Restart to Recover Resubmit w/ -recover Pickup from where it left Map Shuffle Reduce 10 FT-MRMPI for HPC Clusters, SC15

Where to Write Checkpoint Write to GPFS Performance issue due to small I/O 1400 Interferences on shared hardware 1200 Job Completion Time (s) Write to Node Local Disk 1000 800 Fast, no interferences 600 Global availability in recovery? 400 Background Data Copier 200 0 Write local GPFS Local Copier Sync to GPFS in background Wordcount 100GB, 256 procs, ppn=8 Overlapping I/O w/ computation 11 FT-MRMPI for HPC Clusters, SC15

Recover Point Recover to Last File (ft-file) Less frequent checkpoint Need reprocess when recover, lost some work Recover to Last Record (ft-rec) Require fine grained checkpoint Skipping records than reprocessing 300 300 Job Recover Time (s) Job Recover Time (s) 250 250 200 200 150 150 100 100 50 50 0 0 ft-rec ft-file ft-rec ft-file Init Recover Runtime wordcount Skip/Reprocess Init Recover Runtime pagerank Skip/Reprocess 12 FT-MRMPI for HPC Clusters, SC15

Drawbacks of Checkpoint-Restart Checkpoint-Restart works, but not perfect Large Overhead due to Read/Write Checkpoints Requires Human intervention Failure in Recovery 13 FT-MRMPI for HPC Clusters, SC15

Fault Tolerance Model: Detect-Resume Failed Process Normal Process Other Processes Detect Global knowledge of failure Err Hldr Revoke() Identify failed processes by comparing groups Shrink() Resume Fix COMM by excluding failed processes Balanced distribution of affected tasks Work-Conserving vs. Non-Work-Conserving User Level Failure Mitigation (ULFM) MPIX_Comm_revoke() MPIX_Comm_shrink() Map Shuffle Reduce 14 FT-MRMPI for HPC Clusters, SC15

Evaluation Setup LCRC Fusion Cluster[1] 256 nodes CPU: 2-way 8-core Intel Xeon X5550 Memory: 36GB Local Disk: 250 GB Network: Mellanox Infiniband QDR Benchmarks Wordcount, BFS, Pagerank mrmpiBLAST [1] http://www.lcrc.anl.gov 15 FT-MRMPI for HPC Clusters, SC15

Job Performance 10%-13% overhead of checkpointing Up to 39% shorter completion time with failure Job Completion Time (s) Job Completion Time (s) 500 1000 50 100 32 64 128 256 512 1024 2048 32 64 128 256 512 1024 2048 mrmpi checkpoint-restart detect-resume (WC) detect-resume (NWC) MR-MPI checkpoint-restart detect-resume (WC) detect-resume (NWC) 16 FT-MRMPI for HPC Clusters, SC15

Checkpoint Overhead Factors Granularity: number of records per checkpoint Size of records 60 50 Checkpoint Overhead (%) 40 30 20 10 0 1 10 100 1000 10000 100000 1000000 10000000 Number of Records per Checkpoint 17 FT-MRMPI for HPC Clusters, SC15

Time Decomposition Performance with failure and recovery Wordcount, All processes together Detect-Recover has less data that needed to be recovered Checkpoint-Restart Detect-Resume 100% 100% 90% 90% 80% 80% Execution Time Percentage Execution Time Percentage 70% 70% 60% 60% 50% 50% 40% 40% 30% 30% 20% 20% 10% 10% 0% 0% 64 128 256 512 1024 2048 64 128 256 512 1024 2048 map recover aggregate convert reduce map recover aggregate convert reduce 18 FT-MRMPI for HPC Clusters, SC15

Continuous Failures Pagerank 256 processes, randomly kill 1 process every 5 secs 1800 1600 1400 1200 1000 800 600 400 200 0 2 4 8 16 32 64 128 Work-Conserving Non-Work-Conserving Reference 19 FT-MRMPI for HPC Clusters, SC15

Conclusion First Fault Tolerant MapReduce Implementation in MPI Redesign MR-MPI to provide fault tolerance Highly extensible while providing the essential features for FT Two Fault Tolerance Model Checkpoint-Restart Detect-Resume 20 FT-MRMPI for HPC Clusters, SC15

Thank you! Q & A

Backup Slides

Prefetching Data Copier Recover from GPFS Reading everything from GPFS Processes wait for I/O Prefetching in Recovery Move from GPFS to local disk Overlapping I/O with computation

2-Pass KV-KMV Conversion 4-Pass in MR-MPI Excessive disk I/O when shuffle HashTable Keys Hard to make checkpoints 2-Pass KV-KMV Conversion Log-Structure File System KMV sketch KV KMV KV->Sketch, Sketch->KMV

Recover Time Recover from local, GPFS, GPFS w/ prefetching 50 45 40 Recover Time (s) 35 30 25 20 15 10 5 0 64 128 256 512 1024 2048 local GPFS GPFS w/ prefetching 25 FT-MRMPI for HPC Clusters, SC15

Fault-Tolerant MapReduce-MPI for HPC Clusters: Enhancing Fault Tolerance in High-Performance Computing

Download Presentation

Presentation Transcript

Related

More Related Content