User-Level Process Towards Exascale Systems

User-Level Process towards
Exascale Systems
Akio Shimada
[
1]
, Atsushi Hori
[
1]
, Yutaka Ishikawa
[
1]
,
Pavan Balaji
[
2]
[
1]
RIKEN AICS, 
[
2]
Argonne National Laboratory
Background
MPI processes running on a HPC cluster
communicate with each other to e
xchange the
data for parallel computation
An MPI process must wait for a completion of a
communication
Latency hiding can be considered as an
important issue towards Exascale systems
Network system of a HPC cluster will be larger
Methods for Latency Hiding
Non-blocking communication
Overlapping communication and computation
Oversubscription
Binding multiple processes to one CPU core
Switching process when a process is blocked to
wait for a completion of a communication
Problem
Process context switch is slow
The overhead of process context spoils the benefit
of the process oversubscription in some cases
[ Lancu et al. IPDPS 2010 ]
The overhead of jumping into the kernel context
The overhead of the address space switching
Conventional Approach
The 
o
versubscription using user-level thread
(e.g. FG-MPI)
Invoking multiple user-level threads within a process
Assigning a role of an MPI process to a user-level thread
Pros and cons
Pros
Fast context switch
The context switch between user-level threads can be conducted in the
user-space
The context switch between user-level threads does not require address
space switching
Cons
Modification to the application is required
Program code (text) and data (data, bss and heap) are shared among
user-level threads playing a role of an MPI process
Our Solution
User-level process (ULP)
ULP is a “process”, which can be schedules in the user-
space
The ULP has the beneficial features of the user-level thread
The ULP has its own program code and data. (Therefore, we
equate the ULP with “process”.)
Capability of ULP
The ULP enables the low-overhead process oversubscription
Modification to the application is not required
Overview of User-level Process
Task Scheduler (Kernel-space)
data
bss
text
data
heap
data
bss
text
heap
data
bss
text
heap
Task 
S
cheduler (User-space)
data
bss
text
heap
data
bss
text
heap
data
bss
text
heap
Kernel-level
Process
User
-level
Process
User-level
Process
User-level
Process
Kernel-level
Thread
Kernel-level
Thread
Kernel-level
Thread
User-level
Thread
User-level
Thread
User-level
Thread
Execution Context
CPU Core
(a) Kernel-level Process
Kernel-level Process
Kernel-level Process
Kernel-level Process
(b) User-level Process
(c) Kernel-level Thread
(d) User-level Thread
Kernel-level
Process
Kernel-level
Process
stack
stack
stack
stack
stack
stack
stack
stack
stack
bss
heap
text
data
bss
heap
text
stack
stack
stack
Address Space Boundary
Task 
S
cheduler (User-space)
Task Scheduler (Kernel-space)
Task Scheduler (Kernel-space)
Task Scheduler (Kernel-space)
The ULP can be scheduled in the user-space
The low-overhead oversubscription can be achieved by avoiding
the overhead of the process context switch
The ULP has its own program code and data
Modification to the application is not required
Address Space Design
TEXT
DATA&BSS
HEAP
STACK
KERNEL
ULP 0
Address
low
high
TEXT
DATA&BSS
HEAP
STACK
KERNEL
ULP 1
ULP 2
TEXT
DATA&BSS
HEAP
KERNEL
STACK 1
STACK 0
STACK N-1
ULP N-1
STACK 2
Process
User-level Thread
User-level Process
Context Switch
text
data & bss
heap
stack
Partition for
ULP 0
Partition for
ULP 1
registers
text
data & bss
heap
stack
registers
CPU
core
① save context of 
       user-level process 0
② load context of 
       user-level process 1
Low
High
Address
Contex
t switch from ULP 0 to ULP 1
Segment registers must be considered on x86_64 architectures
Segment registers are not accessible from user-space
The 
fs
 register is used for implementing Thread Local Storage (TLS)
Thread safe functions must be build without using TLS
ULP API
int pvas_u
lp_create(int 
*pvd
)
pvas_ulp_create creates address space for ULPs
int pvas_ulp_destroy(int 
pvd
)
pvas_ulp_destroy destroys a created address space
int pvas_ulp_spawn(int 
pvd
, int 
pvid
, char 
*filename
,
char 
**argv
, char 
**environ
)
pvas_ulp_spawn spawns kernel-level process with a ULP
int pvas_u
lp_exec
(int 
pvid
, char 
*filename
, char 
**argv
,
char 
**environ
)
pvas_ulp_exec creats and executes a new ULP
int pvas_ulp_switch(int 
pvid
)
p
vas_ulp_switch conducts context from the current ULP to
the indicated ULP
Preliminary Evaluation (context switch performance)
Benchmark
Invoking multiple parallel processes on a single CPU core
A parallel process may be a kernel-level process or a kernel-level thread or a
user-level thread or a user-level process
Measuring a 
time elapsed until all
 paralle
l process performs context switch
1000 times
The performance of the ULP is competitive with 
that of the 
user-level
thread
Environment
CPU: Inte
l Xeon X5670
          2.93 GHz
OS   : Linux 2.6.32-el6
          for x86_64
L
o
w
e
r
 
i
s
 
b
e
t
t
e
r
Summary and Future Work
Summary
The ULP enables the low-overhead
oversubscription by avoiding the overhead of the
process context switch
The oversubscription using ULP does not require
any modification to the application
Future work
Future work is to embed the capability of the ULP
in the MPI runtimes and evaluate it
Slide Note
Embed
Share

User-Level Process towards Exascale Systems discusses methods for latency hiding in MPI processes running on HPC clusters. It explores issues related to communication latency, oversubscription, and process context switching. The solution proposed is the use of User-Level Processes (ULP) as a low-overhead approach for process oversubscription without requiring application modifications.

  • Exascale Systems
  • Latency Hiding
  • User-Level Processes
  • MPI Processes
  • HPC Clusters

Uploaded on Feb 21, 2025 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. User-Level Process towards Exascale Systems Akio Shimada[1], Atsushi Hori[1], Yutaka Ishikawa[1], Pavan Balaji[2] [1]RIKEN AICS, [2]Argonne National Laboratory

  2. Background MPI processes running on a HPC cluster communicate with each other to exchange the data for parallel computation An MPI process must wait for a completion of a communication Latency hiding can be considered as an important issue towards Exascale systems Network system of a HPC cluster will be larger

  3. Methods for Latency Hiding Non-blocking communication Overlapping communication and computation Oversubscription Binding multiple processes to one CPU core Switching process when a process is blocked to wait for a completion of a communication

  4. Problem Process context switch is slow The overhead of process context spoils the benefit of the process oversubscription in some cases [ Lancu et al. IPDPS 2010 ] The overhead of jumping into the kernel context The overhead of the address space switching

  5. Conventional Approach The oversubscription using user-level thread (e.g. FG-MPI) Invoking multiple user-level threads within a process Assigning a role of an MPI process to a user-level thread Pros and cons Pros Fast context switch The context switch between user-level threads can be conducted in the user-space The context switch between user-level threads does not require address space switching Cons Modification to the application is required Program code (text) and data (data, bss and heap) are shared among user-level threads playing a role of an MPI process

  6. Our Solution User-level process (ULP) ULP is a process , which can be schedules in the user- space The ULP has the beneficial features of the user-level thread The ULP has its own program code and data. (Therefore, we equate the ULP with process .) Capability of ULP The ULP enables the low-overhead process oversubscription Modification to the application is not required Kernel-level Process User-level Thread User-level Process Context switch Slow Fast Fast Modification to the application Not required Required Not required

  7. Overview of User-level Process (a) Kernel-level Process (b) User-level Process (c) Kernel-level Thread (d) User-level Thread Kernel-level Process Kernel-level Process User-level Process User-level Process User-level Process text data bss heap Kernel-level Process text text text data bss heap stack data bss heap stack data bss heap stack text User-level Thread User-level Thread User-level Thread Kernel-level Process Kernel-level Process Kernel-level Process data bss heap stack stack stack text text text data bss heap stack Kernel-level Thread data bss heap stack data bss heap stack Kernel-level Thread Kernel-level Thread stack stack stack Task Scheduler (User-space) Task Scheduler (User-space) Task Scheduler (Kernel-space) Task Scheduler (Kernel-space) Task Scheduler (Kernel-space) Task Scheduler (Kernel-space) C C C C C Address Space Boundary CPU Core Execution Context The ULP can be scheduled in the user-space The low-overhead oversubscription can be achieved by avoiding the overhead of the process context switch The ULP has its own program code and data Modification to the application is not required

  8. Address Space Design Process User-level Process User-level Thread low TEXT TEXT TEXT ULP 0 DATA&BSS DATA&BSS DATA&BSS HEAP ULP 1 STACK HEAP HEAP ULP 2 Address STACK 0 STACK STACK 1 STACK 2 ULP N-1 STACK N-1 KERNEL KERNEL KERNEL high

  9. Context Switch Context switch from ULP 0 to ULP 1 Low save context of user-level process 0 registers text Partition for ULP 0 data & bss heap stack CPU core Address registers load context of user-level process 1 text Partition for ULP 1 data & bss heap stack High Segment registers must be considered on x86_64 architectures Segment registers are not accessible from user-space The fs register is used for implementing Thread Local Storage (TLS) Thread safe functions must be build without using TLS

  10. ULP API int pvas_ulp_create(int *pvd) pvas_ulp_create creates address space for ULPs int pvas_ulp_destroy(int pvd) pvas_ulp_destroy destroys a created address space int pvas_ulp_spawn(int pvd, int pvid, char *filename, char **argv, char **environ) pvas_ulp_spawn spawns kernel-level process with a ULP int pvas_ulp_exec(int pvid, char *filename, char **argv, char **environ) pvas_ulp_exec creats and executes a new ULP int pvas_ulp_switch(int pvid) pvas_ulp_switch conducts context from the current ULP to the indicated ULP

  11. Preliminary Evaluation (context switch performance) 1.4? Environment CPU: Intel Xeon X5670 2.93 GHz OS : Linux 2.6.32-el6 for x86_64 kernel-level? process? 1.2? kernel-level? thread? user-level? thread? (massivethreads)? 1? user-level? process? ? Elapsed? Time? (sec)? Lower is better user-level? process? (ignoring? fs)? 0.8? 0.6? 0.4? 0.2? 0? 100? 200? 300? 400? 500? 600? 700? ? 800? 900? 1000? Number? of? Parallel? Processes? Benchmark Invoking multiple parallel processes on a single CPU core A parallel process may be a kernel-level process or a kernel-level thread or a user-level thread or a user-level process Measuring a time elapsed until all parallel process performs context switch 1000 times The performance of the ULP is competitive with that of the user-level thread

  12. Summary and Future Work Summary The ULP enables the low-overhead oversubscription by avoiding the overhead of the process context switch The oversubscription using ULP does not require any modification to the application Future work Future work is to embed the capability of the ULP in the MPI runtimes and evaluate it

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#