User-Level Process Towards Exascale Systems
User-Level Process towards Exascale Systems discusses methods for latency hiding in MPI processes running on HPC clusters. It explores issues related to communication latency, oversubscription, and process context switching. The solution proposed is the use of User-Level Processes (ULP) as a low-overhead approach for process oversubscription without requiring application modifications.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
User-Level Process towards Exascale Systems Akio Shimada[1], Atsushi Hori[1], Yutaka Ishikawa[1], Pavan Balaji[2] [1]RIKEN AICS, [2]Argonne National Laboratory
Background MPI processes running on a HPC cluster communicate with each other to exchange the data for parallel computation An MPI process must wait for a completion of a communication Latency hiding can be considered as an important issue towards Exascale systems Network system of a HPC cluster will be larger
Methods for Latency Hiding Non-blocking communication Overlapping communication and computation Oversubscription Binding multiple processes to one CPU core Switching process when a process is blocked to wait for a completion of a communication
Problem Process context switch is slow The overhead of process context spoils the benefit of the process oversubscription in some cases [ Lancu et al. IPDPS 2010 ] The overhead of jumping into the kernel context The overhead of the address space switching
Conventional Approach The oversubscription using user-level thread (e.g. FG-MPI) Invoking multiple user-level threads within a process Assigning a role of an MPI process to a user-level thread Pros and cons Pros Fast context switch The context switch between user-level threads can be conducted in the user-space The context switch between user-level threads does not require address space switching Cons Modification to the application is required Program code (text) and data (data, bss and heap) are shared among user-level threads playing a role of an MPI process
Our Solution User-level process (ULP) ULP is a process , which can be schedules in the user- space The ULP has the beneficial features of the user-level thread The ULP has its own program code and data. (Therefore, we equate the ULP with process .) Capability of ULP The ULP enables the low-overhead process oversubscription Modification to the application is not required Kernel-level Process User-level Thread User-level Process Context switch Slow Fast Fast Modification to the application Not required Required Not required
Overview of User-level Process (a) Kernel-level Process (b) User-level Process (c) Kernel-level Thread (d) User-level Thread Kernel-level Process Kernel-level Process User-level Process User-level Process User-level Process text data bss heap Kernel-level Process text text text data bss heap stack data bss heap stack data bss heap stack text User-level Thread User-level Thread User-level Thread Kernel-level Process Kernel-level Process Kernel-level Process data bss heap stack stack stack text text text data bss heap stack Kernel-level Thread data bss heap stack data bss heap stack Kernel-level Thread Kernel-level Thread stack stack stack Task Scheduler (User-space) Task Scheduler (User-space) Task Scheduler (Kernel-space) Task Scheduler (Kernel-space) Task Scheduler (Kernel-space) Task Scheduler (Kernel-space) C C C C C Address Space Boundary CPU Core Execution Context The ULP can be scheduled in the user-space The low-overhead oversubscription can be achieved by avoiding the overhead of the process context switch The ULP has its own program code and data Modification to the application is not required
Address Space Design Process User-level Process User-level Thread low TEXT TEXT TEXT ULP 0 DATA&BSS DATA&BSS DATA&BSS HEAP ULP 1 STACK HEAP HEAP ULP 2 Address STACK 0 STACK STACK 1 STACK 2 ULP N-1 STACK N-1 KERNEL KERNEL KERNEL high
Context Switch Context switch from ULP 0 to ULP 1 Low save context of user-level process 0 registers text Partition for ULP 0 data & bss heap stack CPU core Address registers load context of user-level process 1 text Partition for ULP 1 data & bss heap stack High Segment registers must be considered on x86_64 architectures Segment registers are not accessible from user-space The fs register is used for implementing Thread Local Storage (TLS) Thread safe functions must be build without using TLS
ULP API int pvas_ulp_create(int *pvd) pvas_ulp_create creates address space for ULPs int pvas_ulp_destroy(int pvd) pvas_ulp_destroy destroys a created address space int pvas_ulp_spawn(int pvd, int pvid, char *filename, char **argv, char **environ) pvas_ulp_spawn spawns kernel-level process with a ULP int pvas_ulp_exec(int pvid, char *filename, char **argv, char **environ) pvas_ulp_exec creats and executes a new ULP int pvas_ulp_switch(int pvid) pvas_ulp_switch conducts context from the current ULP to the indicated ULP
Preliminary Evaluation (context switch performance) 1.4? Environment CPU: Intel Xeon X5670 2.93 GHz OS : Linux 2.6.32-el6 for x86_64 kernel-level? process? 1.2? kernel-level? thread? user-level? thread? (massivethreads)? 1? user-level? process? ? Elapsed? Time? (sec)? Lower is better user-level? process? (ignoring? fs)? 0.8? 0.6? 0.4? 0.2? 0? 100? 200? 300? 400? 500? 600? 700? ? 800? 900? 1000? Number? of? Parallel? Processes? Benchmark Invoking multiple parallel processes on a single CPU core A parallel process may be a kernel-level process or a kernel-level thread or a user-level thread or a user-level process Measuring a time elapsed until all parallel process performs context switch 1000 times The performance of the ULP is competitive with that of the user-level thread
Summary and Future Work Summary The ULP enables the low-overhead oversubscription by avoiding the overhead of the process context switch The oversubscription using ULP does not require any modification to the application Future work Future work is to embed the capability of the ULP in the MPI runtimes and evaluate it