Impact of High-Bandwidth Memory on MPI Communications

Slide Note

Exploring the impact of high-bandwidth memory on MPI communications, this study delves into the exacerbation of the memory wall problem at Exascale and the need to leverage new memory technologies. Topics covered include intranode communication in MPICH, Intel Knight Landing memory architecture, and evaluations using OSU microbenchmarks and Stencil miniapp.

ril_vic Follow

Uploaded on Oct 06, 2024 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript

Evaluating the Impact of High- Bandwidth Memory on MPI Communications Giuseppe Congiu gcongiu@anl.gov Pavan Balaji balaji@anl.gov Mathematics and Computer Science Division Argonne National Laboratory

Motivation Exacerbation of memory wall problem at Exascale Increasing number of cores per node calls for faster memories Emergence of new memory technologies & deep hierarchies 3D stacked on-package High Bandwidth Memory (HBM) Intel Knight Landing Multi-Channel DRAM (MCDRAM) NVIDIA GPGPUs HBM2 (Future HBM3/4) Fujitsu A64FX Post-K HBM2 Non-Volatile Memories (NVMs) Intel/Micron 3D-Xpoint (NVDIMM) ? Need to leverage new memory technologies in MPI Improve shared memory intranode communications performance 2 Giuseppe Congiu ICCC 2018, Chengdu 12/08/2018

Agenda Intranode Communication in MPICH MPICH architecture and the Nemesis channel Point-to-point and Remote Memory Access Intel Knight Landing Memory Architecture Heterogeneous Memory in Linux Systems Heterogeneous Shared Memory in MPICH High-Bandwidth Memory Evaluation OSU microbenchmarks Stencil miniapp Conclusions 3 Giuseppe Congiu ICCC 2018, Chengdu 12/08/2018

MPICH Architecture and the Nemesis Channel Modular Design CH3 legacy low level transport implementation Use Nemesis channel for intranode shared memory communications CH4 new light weight transport implementation (exploit hardware offload) Intranode shared memory comm. is provided as additional module (Shmod) Application MPI Interface MPI_Send(), MPI_Recv(), MPI_Put(), MPI_Get() Abstract Device Interface Decouples MPI interface from underlying transport MPICH Device implementation (generic transport implementation & utils) CH3 CH4 Nemesis Sock Nemesis Channel provides shared memory comm. Specific network support (Netmods): open fabrics, unified communication X OFI OFI UCX Shmod Hardware (Network, Memory) 4 Giuseppe Congiu ICCC 2018, Chengdu 12/08/2018

Point-to-point and Remote Memory Access (1/2) Intranode communication goes through shared memory Works for both pt2pt and RMA (MPI_Win_create + MPI_Put/MPI_Get) Short messages use fastboxes and cells memory objects Eager protocol Long messages also use copy buffers (receiver allocates cb and returns handle to sender that transfers data using it) Rendezvous protocol MPI_Isend( , 1, ) t0 t1 t2 t3 MPI_Isend( , 1, ) Application MPI_Recv( , 0, ) MPI Interface MPI_Recv( , 0, ) memcpy memcpy memcpy memcpy Abstract Device Interface t enqueue cell CH3 CH4 dequeue dequeue Nemesis Sock OFI OFI UCX Shmod enqueue Copy Buffer Hardware (Network, Memory) 5 Giuseppe Congiu ICCC 2018, Chengdu 12/08/2018

Point-to-point and Remote Memory Access (2/2) RMA using MPI_Win_allocate (or MPI_Win_allocate_shared) Collective call (all processes in the communicator invoke it) Returns shared memory win handle and allocated memory ptr Total size of shared memory window is: size x # procs MPI_Win_allocate(size, ptr, &win) MPI_Win_allocate(size, ptr, &win) MPI_Put(5, , win) MPI_Get(0, , win) Application MPI Interface memcpy memcpy Abstract Device Interface CH3 CH4 shared memory window 3 4 6 7 2 1 Nemesis Sock size + OFI OFI UCX Shmod All processes collaborate to create the shared memory window Hardware (Network, Memory) 6 Giuseppe Congiu ICCC 2018, Chengdu 12/08/2018

Intel Knight Landing Memory Architecture Memory Modes PCIe3 x36 MCDRAM (4GB) MCDRAM Cache: direct mapped L3 Flat: user addressable Hybrid: combination of cache and flat (25/75, 50/50, 75/25) Cluster Modes All-to-all, Quadrant, Hemisphere, SNC-2, SNC-4 TILE DDR4 (192GB) DDR4 CHA 2VPUs 2VPUs L2 1MB Core Core MCDRAM MCDRAM L1 L1 32KB 7 Giuseppe Congiu ICCC 2018, Chengdu 12/08/2018

Heterogeneous Memory in Linux Accessing MCDRAM in Flat mode NUMA 0 MCDRAM detected as a separate NUMA node with no associated CPU cores Physical Address Space Virtual Address Space CPU Virtual Memory has to be allocated using mmap (page aligned and multiple of page size) DRAM DRAM (384 GB) MCDRAM MCDRAM (16 GB) Need to bind virtual memory to physical memory in MCDRAM using mbind NUMA 1 mbind(1) Disk mbind is Linux specific, more portable interfaces are available (e.g., hwloc_set_area_membind) mmap(size) 8 Giuseppe Congiu ICCC 2018, Chengdu 12/08/2018

Heterogeneous Shared Memory in MPICH Detect different types of memory devices MPI Interface DRAM Abstract Device Interface HBM MCDRAM CH3 CH4 HBM NVRAM CPU Nemesis Sock NVRAM OFI OFI UCX Shmod ? DRAM Hardware (Network, Memory) Migrate MPICH objects to different memory types for intranode: MPI_Win_allocate(MPI_Aint size, int disp_unit, MPI_Info info, MPI_Comm comm, void *baseptr, MPI_Win *win) MPI_Win_allocate_shared( MPI_Aint size, int disp_unit, MPI_Info info, MPI_Comm comm, void *baseptr, MPI_Win *win) pt2pt: Fastboxes, Cells and Copy Buffers objects RMA: shared memory window objects Pt2pt: use CVARs to set memory binding RMA: use MPI_Info object (hard coded) to set memory binding and use CVARs to overwrite it (environment set) 9 Giuseppe Congiu ICCC 2018, Chengdu 12/08/2018

Evaluation Aggregated Memory Bandwidth Testbed Much better starting More than 4X improvement from T = 16 KNL Node on JLSE cluster KNL-7210 Cores: 64 (Quadrant) DRAM: 192 GB MCDRAM: 16 GB (Flat) DRAM No difference for T = 1, 2 Microbenchmark STREAM Copy Load Elem/Thread (double) from var A Store Elem/Thread into var B Use increasing number of OpenMP threads (T) MCDRAM 10 Giuseppe Congiu ICCC 2018, Chengdu 12/08/2018

Evaluation MPI Communication Bandwidth (1/2) Target in MCDRAM (T) Testbed KNL Node on JLSE cluster Memory Cache KNL-7210 Cores: 64 (Quadrant) DRAM: 192 GB Target in DRAM (t) MCDRAM: 16 GB (Flat) Microbenchmark MPI_Put OSU multi-bw RMA Memory MPI_Put/Get using 32 process pairs Keep origin buf in DRAM and move target buf using MPI_Win_allocate Cache Procs Shmem window (target) 0 1 2 3 4 5 6 7 MPI_Get 11 Giuseppe Congiu ICCC 2018, Chengdu 12/08/2018

Evaluation MPI Communication Bandwidth (2/2) Testbed KNL Node on JLSE cluster KNL-7210 Cores: 64 (Quadrant) DRAM: 192 GB Everything fits in Cache (Eager Protocol) MCDRAM: 16 GB (Flat) Microbenchmark Eager OSU multi-bw pt2pt Rendezvous MPI_Isend/Recv using 32 process pairs Move fastbox, cells and copy buffers (e.g., f fbox in DRAM, F fbox in MCDRAM) Procs 12 Giuseppe Congiu ICCC 2018, Chengdu 12/08/2018

Evaluation MPI Communication Latency (1/2) MPI_PUT MPI_GET Testbed o/t o/T Red [%] 49.142 o/t o/T Red [%] 49.069 79.76 47.906 47.491 47.907 47.61 47.845 48.619 48.503 48.251 48.13 48.411 48.468 48.328 48.652 49.881 50.54 53.937 65.1 10.214 0 1 112.389 175.953 2 48.571 4 47.795 8 48.107 16 47.985 32 48.32 64 49.186 128 49.039 256 48.855 512 48.804 1 KB 49.046 2 KB 48.87 4 KB 48.496 8 KB 48.983 16 KB 50.121 32 KB 51.545 64 KB 54.671 128 KB 71.826 256 KB 259.659 512 KB 545.213 252.521 53.683 1 MB1039.163 558.357 46.268 1037.612 931.533 10.223 2 MB1940.753 999.427 48.503 1939.0891717.608 11.421 4 MB3739.691 1857.87 50.320 3739.1743294.915 11.881 49.192 48.998 111.547 48.105 47.747 48.029 48.044 48.243 48.85 48.607 48.333 48.171 48.568 48.553 48.442 48.787 50.114 51.702 54.41 72.506 257.58 257.371 545.257 506.752 - - KNL Node on JLSE cluster - - 48.041 47.903 48.282 48.653 48.572 48.68 48.319 47.925 47.925 48.247 47.934 48.274 48.893 50.086 50.953 53.892 66.411 92.015 64.563 - - KNL-7210 - - - - Cores: 64 (Quadrant) - - DRAM: 192 GB - - - - MCDRAM: 16 GB (Flat) - - - - Microbenchmark - - - - OSU multi-lat RMA - - - - MPI_Put/Get using 32 process pairs - - Keep origin buf in DRAM and move target buf using MPI_Win_allocate - - - - - - 7.539 Procs 0.081 7.061 0 1 2 3 4 5 6 7 13 Giuseppe Congiu ICCC 2018, Chengdu 12/08/2018

Evaluation MPI Communication Latency (2/2) f/c/cb 0.841 0.891 0.901 0.938 0.951 0.981 1.192 1.192 1.238 1.252 1.609 2.659 1.728 2.828 5.377 11.496 23.682 50.975 F/c/cb 0.827 0.883 0.893 0.928 0.94 0.971 1.191 1.191 1.237 1.252 1.509 2.566 1.722 2.777 4.639 8.342 15.068 51.087 f/C/cb 0.843 0.892 0.902 0.942 0.952 0.983 1.194 1.197 1.24 1.257 1.615 2.666 1.734 2.807 5.43 11.503 23.745 50.999 F/C/cb 0.829 0.886 0.895 0.93 0.945 0.974 1.194 1.192 1.238 1.255 1.514 2.572 1.724 2.788 4.649 8.343 15.03 50.861 f/c/CB Red [%] 0.838 0.892 0.902 0.94 0.95 0.98 1.191 1.191 1.236 1.252 1.609 2.661 1.727 2.818 5.376 11.514 23.702 44.454 89.12 Testbed 0 1 2 4 8 - - - - - - - - - - KNL Node on JLSE cluster KNL-7210 Cores: 64 (Quadrant) 16 32 64 DRAM: 192 GB 128 256 512 1 KB 2 KB 4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 116.144 118.372 117.863 118.464 256 KB 530.683 511.696 504.205 484.455 321.064 512 KB 984.598 982.653 984.732 1 MB1920.2181921.2611916.3771913.3671313.256 2 MB3807.1383805.5673809.3883806.4012558.192 4 MB7567.6537567.1297577.4827567.4855090.969 MCDRAM: 16 GB (Flat) Microbenchmark 6.2 3.5 0.3 1.8 13.7 27.4 36.4 12.8 23.3 39.5 32.6 31.6 32.8 32.7 OSU multi-lat pt2pt MPI_Isend/Recv using 32 process pairs Move fastbox, cells and copy buffers (e.g., f fbox in DRAM, F fbox in MCDRAM) 982.19 663.066 Procs 14 Giuseppe Congiu ICCC 2018, Chengdu 12/08/2018

Recommendations to Users For RMA always place target buffer in MCDRAM Better perf. if store buffer is in MCDRAM (Put performs better than Get) For point-to-point consider worst case communication scenario Each process communicates with every other process N x (N - 1) Each process MPI_Isend to use all the cells in its freeQ Point-to-point memory footprint for 64 processes Fastboxes: 64 KB x N x (N - 1) 252 MB (~1.5 % of MCDRAM) Cells: 64 KB x 64 x N 256 MB (~1.5 % of MCDRAM) Copy Buffers: 32 KB x 8 x N 1008 MB (~6 % of MCDRAM) MCDRAM (HBM) is limited Save as much HBM as possible for user buffers (prioritize) If available budget is ~1.5 % place fastboxes If available budget is ~3 % place cells too If available budget is ~9 % place copy buffers too 15 Giuseppe Congiu ICCC 2018, Chengdu 12/08/2018

Real Use Case: 2D Stencil Code In most practical cases communication is confined and limited Stencil code Inner processes: 6 x 6 x 4 = 144 comm. Side processes: 6 x 4 x 3 = 72 comm. Corner processes: 4 x 2 = 8 comm. Total: 224/4032 communications Actual memory footprint Fastboxes: 64 x 224 = 14 MB (~0.08 %) Cells: 0 MB (each process sends one message at time to every neighbor) Copy buffers: 32 KB x 8 x 224 = 56 MB (~0.34 %) 16 Giuseppe Congiu ICCC 2018, Chengdu 12/08/2018

Evaluation Stencil Runtime (1/2) Testbed KNL Node on JLSE cluster KNL-7210 Cores: 64 (Quadrant) DRAM: 192 GB MCDRAM: 16 GB (Flat) Miniapp Benchmark 2D Stencil Code: Domain size (x, y) From 2048 x 2048 (32 MB) to 65536 x 65536 (32 GB) points (doubles) Halo from 2 KB to 64 KB along each direction Exchange halos using MPI_Isend/MPI_Irecv and MPI_Waitall Measure total runtime using different placement of fbox and copy buf 17 Giuseppe Congiu ICCC 2018, Chengdu 12/08/2018

Evaluation Stencil Runtime (2/2) Only consider fastboxes and copy buffers Stencil only exchanges one halo at a time with every neighbor Short messages always satisfied by fastboxes Long messages always satisfied by fastboxes (for header) and copy buffers (for the rest of the message) For 2 KB halos stencil memory footprint is 32 MB x 2 Need to store old and new matrix Up to 32 KB halos stencil only uses fastboxes (optimize them) f/cb F/cb f/CB Red F [%] Red CB [%] 2 KB 0.357 0.348 0.354 - - 4 KB 0.730 0.725 0.728 - - Over 32 KB halos stencil already uses more MCDRAM than available 8 KB 1.011 0.913 1.022 9.7 - 16 KB 0.618 0.560 0.615 9.4 - 32 KB 1.764 1.588 1.765 9.9 - Only optimize copy buffers 64 KB 4.101 4.090 3.856 - 6.1 18 Giuseppe Congiu ICCC 2018, Chengdu 12/08/2018

Conclusions General recommendations can be based on worst case scenario (all-to-all communication pattern) Memory usage, and thus placement of objects, actually dependents upon Domain partitioning Communication pattern When application memory footprint exceeds MCDRAM capacity, library memory usage should be minimized Customized memory migration of user data should be performed MPI should optimize short messages by placing fastboxes in HBM If application uses long messages memory requirements might be too high avoid moving copy buffers in those cases 19 Giuseppe Congiu ICCC 2018, Chengdu 12/08/2018