User-Level Process Towards Exascale Systems

User-Level Process towards

Exascale Systems

Akio Shimada

1]

, Atsushi Hori

1]

, Yutaka Ishikawa

1]

Pavan Balaji

2]

1]

RIKEN AICS,

2]

Argonne National Laboratory

Background

•

MPI processes running on a HPC cluster

communicate with each other to e

xchange the

data for parallel computation

–

An MPI process must wait for a completion of a

communication

•

Latency hiding can be considered as an

important issue towards Exascale systems

–

Network system of a HPC cluster will be larger

Methods for Latency Hiding

•

Non-blocking communication

–

Overlapping communication and computation

•

Oversubscription

–

Binding multiple processes to one CPU core

–

Switching process when a process is blocked to

wait for a completion of a communication

Problem

•

Process context switch is slow

–

The overhead of process context spoils the benefit

of the process oversubscription in some cases

[ Lancu et al. IPDPS 2010 ]

•

The overhead of jumping into the kernel context

•

The overhead of the address space switching

Conventional Approach

•

The

versubscription using user-level thread

(e.g. FG-MPI)

–

Invoking multiple user-level threads within a process

–

Assigning a role of an MPI process to a user-level thread

•

Pros and cons

–

Pros

•

Fast context switch

–

The context switch between user-level threads can be conducted in the

user-space

–

The context switch between user-level threads does not require address

space switching

–

Cons

•

Modification to the application is required

–

Program code (text) and data (data, bss and heap) are shared among

user-level threads playing a role of an MPI process

Our Solution

•

User-level process (ULP)

–

ULP is a “process”, which can be schedules in the user-

space

•

The ULP has the beneficial features of the user-level thread

•

The ULP has its own program code and data. (Therefore, we

equate the ULP with “process”.)

–

Capability of ULP

•

The ULP enables the low-overhead process oversubscription

•

Modification to the application is not required

Overview of User-level Process

Task Scheduler (Kernel-space)

data

bss

text

data

heap

data

bss

text

heap

data

bss

text

heap

Task

cheduler (User-space)

data

bss

text

heap

data

bss

text

heap

data

bss

text

heap

Kernel-level

Process

User

-level

Process

User-level

Process

User-level

Process

Kernel-level

Thread

Kernel-level

Thread

Kernel-level

Thread

User-level

Thread

User-level

Thread

User-level

Thread

Execution Context

CPU Core

(a) Kernel-level Process

Kernel-level Process

Kernel-level Process

Kernel-level Process

(b) User-level Process

(c) Kernel-level Thread

(d) User-level Thread

Kernel-level

Process

Kernel-level

Process

stack

stack

stack

stack

stack

stack

stack

stack

stack

bss

heap

text

data

bss

heap

text

stack

stack

stack

Address Space Boundary

Task

cheduler (User-space)

Task Scheduler (Kernel-space)

Task Scheduler (Kernel-space)

Task Scheduler (Kernel-space)

•

The ULP can be scheduled in the user-space

–

The low-overhead oversubscription can be achieved by avoiding

the overhead of the process context switch

•

The ULP has its own program code and data

–

Modification to the application is not required

Address Space Design

TEXT

DATA&BSS

HEAP

STACK

KERNEL

ULP 0

Address

low

high

TEXT

DATA&BSS

HEAP

STACK

・

・

・

KERNEL

ULP 1

ULP 2

TEXT

DATA&BSS

HEAP

KERNEL

STACK 1

STACK 0

STACK N-1

ULP N-1

STACK 2

・

・

・

Process

User-level Thread

User-level Process

Context Switch

text

data & bss

heap

stack

Partition for

ULP 0

Partition for

ULP 1

registers

text

data & bss

heap

stack

registers

CPU

core

① save context of

       user-level process 0

② load context of

       user-level process 1

・

・

・

Low

High

Address

Contex

t switch from ULP 0 to ULP 1

•

Segment registers must be considered on x86_64 architectures

–

Segment registers are not accessible from user-space

–

The

fs

 register is used for implementing Thread Local Storage (TLS)

–

Thread safe functions must be build without using TLS

ULP API

•

int pvas_u

lp_create(int

*pvd

–

pvas_ulp_create creates address space for ULPs

•

int pvas_ulp_destroy(int

pvd

–

pvas_ulp_destroy destroys a created address space

•

int pvas_ulp_spawn(int

pvd

, int

pvid

, char

*filename

char

**argv

, char

**environ

–

pvas_ulp_spawn spawns kernel-level process with a ULP

•

int pvas_u

lp_exec

(int

pvid

, char

*filename

, char

**argv

char

**environ

–

pvas_ulp_exec creats and executes a new ULP

•

int pvas_ulp_switch(int

pvid

–

vas_ulp_switch conducts context from the current ULP to

the indicated ULP

Preliminary Evaluation (context switch performance)

•

Benchmark

–

Invoking multiple parallel processes on a single CPU core

–

A parallel process may be a kernel-level process or a kernel-level thread or a

user-level thread or a user-level process

–

Measuring a

time elapsed until all

 paralle

l process performs context switch

1000 times

•

The performance of the ULP is competitive with

that of the

user-level

thread

Environment

CPU: Inte

l Xeon X5670

          2.93 GHz

OS   : Linux 2.6.32-el6

          for x86_64

Summary and Future Work

•

Summary

–

The ULP enables the low-overhead

oversubscription by avoiding the overhead of the

process context switch

–

The oversubscription using ULP does not require

any modification to the application

•

Future work

–

Future work is to embed the capability of the ULP

in the MPI runtimes and evaluate it

Slide Note

Embed Share

Download

User-Level Process towards Exascale Systems discusses methods for latency hiding in MPI processes running on HPC clusters. It explores issues related to communication latency, oversubscription, and process context switching. The solution proposed is the use of User-Level Processes (ULP) as a low-overhead approach for process oversubscription without requiring application modifications.

lakhani_a Follow

Uploaded on Feb 21, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript

User-Level Process towards Exascale Systems Akio Shimada[1], Atsushi Hori[1], Yutaka Ishikawa[1], Pavan Balaji[2] [1]RIKEN AICS, [2]Argonne National Laboratory

Background MPI processes running on a HPC cluster communicate with each other to exchange the data for parallel computation An MPI process must wait for a completion of a communication Latency hiding can be considered as an important issue towards Exascale systems Network system of a HPC cluster will be larger

Methods for Latency Hiding Non-blocking communication Overlapping communication and computation Oversubscription Binding multiple processes to one CPU core Switching process when a process is blocked to wait for a completion of a communication

Problem Process context switch is slow The overhead of process context spoils the benefit of the process oversubscription in some cases [ Lancu et al. IPDPS 2010 ] The overhead of jumping into the kernel context The overhead of the address space switching

Conventional Approach The oversubscription using user-level thread (e.g. FG-MPI) Invoking multiple user-level threads within a process Assigning a role of an MPI process to a user-level thread Pros and cons Pros Fast context switch The context switch between user-level threads can be conducted in the user-space The context switch between user-level threads does not require address space switching Cons Modification to the application is required Program code (text) and data (data, bss and heap) are shared among user-level threads playing a role of an MPI process

Our Solution User-level process (ULP) ULP is a process , which can be schedules in the user- space The ULP has the beneficial features of the user-level thread The ULP has its own program code and data. (Therefore, we equate the ULP with process .) Capability of ULP The ULP enables the low-overhead process oversubscription Modification to the application is not required Kernel-level Process User-level Thread User-level Process Context switch Slow Fast Fast Modification to the application Not required Required Not required

Overview of User-level Process (a) Kernel-level Process (b) User-level Process (c) Kernel-level Thread (d) User-level Thread Kernel-level Process Kernel-level Process User-level Process User-level Process User-level Process text data bss heap Kernel-level Process text text text data bss heap stack data bss heap stack data bss heap stack text User-level Thread User-level Thread User-level Thread Kernel-level Process Kernel-level Process Kernel-level Process data bss heap stack stack stack text text text data bss heap stack Kernel-level Thread data bss heap stack data bss heap stack Kernel-level Thread Kernel-level Thread stack stack stack Task Scheduler (User-space) Task Scheduler (User-space) Task Scheduler (Kernel-space) Task Scheduler (Kernel-space) Task Scheduler (Kernel-space) Task Scheduler (Kernel-space) C C C C C Address Space Boundary CPU Core Execution Context The ULP can be scheduled in the user-space The low-overhead oversubscription can be achieved by avoiding the overhead of the process context switch The ULP has its own program code and data Modification to the application is not required

Address Space Design Process User-level Process User-level Thread low TEXT TEXT TEXT ULP 0 DATA&BSS DATA&BSS DATA&BSS HEAP ULP 1 STACK HEAP HEAP ULP 2 Address STACK 0 STACK STACK 1 STACK 2 ULP N-1 STACK N-1 KERNEL KERNEL KERNEL high

Context Switch Context switch from ULP 0 to ULP 1 Low save context of user-level process 0 registers text Partition for ULP 0 data & bss heap stack CPU core Address registers load context of user-level process 1 text Partition for ULP 1 data & bss heap stack High Segment registers must be considered on x86_64 architectures Segment registers are not accessible from user-space The fs register is used for implementing Thread Local Storage (TLS) Thread safe functions must be build without using TLS

ULP API int pvas_ulp_create(int *pvd) pvas_ulp_create creates address space for ULPs int pvas_ulp_destroy(int pvd) pvas_ulp_destroy destroys a created address space int pvas_ulp_spawn(int pvd, int pvid, char *filename, char **argv, char **environ) pvas_ulp_spawn spawns kernel-level process with a ULP int pvas_ulp_exec(int pvid, char *filename, char **argv, char **environ) pvas_ulp_exec creats and executes a new ULP int pvas_ulp_switch(int pvid) pvas_ulp_switch conducts context from the current ULP to the indicated ULP

Preliminary Evaluation (context switch performance) 1.4? Environment CPU: Intel Xeon X5670 2.93 GHz OS : Linux 2.6.32-el6 for x86_64 kernel-level? process? 1.2? kernel-level? thread? user-level? thread? (massivethreads)? 1? user-level? process? ? Elapsed? Time? (sec)? Lower is better user-level? process? (ignoring? fs)? 0.8? 0.6? 0.4? 0.2? 0? 100? 200? 300? 400? 500? 600? 700? ? 800? 900? 1000? Number? of? Parallel? Processes? Benchmark Invoking multiple parallel processes on a single CPU core A parallel process may be a kernel-level process or a kernel-level thread or a user-level thread or a user-level process Measuring a time elapsed until all parallel process performs context switch 1000 times The performance of the ULP is competitive with that of the user-level thread

Summary and Future Work Summary The ULP enables the low-overhead oversubscription by avoiding the overhead of the process context switch The oversubscription using ULP does not require any modification to the application Future work Future work is to embed the capability of the ULP in the MPI runtimes and evaluate it

User-Level Process Towards Exascale Systems

Download Presentation

Presentation Transcript

Related

More Related Content