E3SM Next-Gen Architecture & Strategy Insights
E3SM is transitioning to GPU systems for enhanced performance while maintaining computational mission. Detailed benchmarks show where GPUs outperform CPUs, particularly in ultra-high resolutions. The strategic roadmap outlines the shift towards GPU-centric simulations for optimal efficiency and complexity in climate modeling.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
E3SM Next-Gen Architecture Strategy Mark Taylor mataylo@sandia.gov
E3SM Machine Roadmap CPUs: BER purchased dedicated systems Anvil, CompyMcNodeFace: 6.2M node-hours per year Great machines for moderate resolutions fastest performance Not large enough for high-res science campaigns KNL Architecture (Cori, Theta) Most of v1/v2 simulations will be run on KNL, through 2021) 2019 allocation: 10.5M node-hours (ERCAP, INCITE, ALCC) Our future is GPUs: 2019 allocation: 6M GPU hours OLCF: Summit 2019 NERSC: Perlmutter: 2020 ALCF: Aurora 2021
Strategy CPU systems are not going away - other modeling centers may continue to run on large CPU systems DOE will be nearly 100% GPU systems by 2021 E3SM computational mission: An Earth system model than runs effectively on DOE computers E3SM must run on GPU systems E3SM phase 2 will focus on simulation regimes where GPU systems can outperform similar size CPU systems
Where can GPUs outperform CPUs? Good understanding of how GPU and KNL architectures perform based on detailed benchmarks 100% of the code running on KNL 25% of the code running well on GPUs (atmosphere dycore). 50% partially complete (ocean/ice components) GPU systems require large amount of work per node in order to outperform CPU systems (per watt) E3SM has excellent MPI performance allowing us to obtain throughput by running in the strong scaling limit (little work per node). In this regime, GPUs are not competitive with CPUs (per watt)
Where can GPUs outperform CPUs? Ultra-high resolution (SCREAM project) At 3km resolutions and higher, there will be sufficient work per node to keep the GPU happy At these resolutions, GPU systems could be up to 3x faster than CPU systems Large allocations could afford to make O(10) year simulations Can start to address many large uncertainties in climate models, but doesn t replace climate models Low and High Resolutions (E3SM v2/v3) At these resolutions, maximum throughput will be obtained in the strong scaling regime, where GPUs will not provide any significant benefit over CPUs If maximum throughput is not required, GPUs will allow more efficient simulations Best estimate: If willing to run 2-5x slower, can obtain more simulations per Watt. Increased complexity GPUs will allow us to increase complexity with proportionally less increase in cost e.g.: super-parameterization, more tracers
Code Readiness 1 E3SMv1 All configurations run well on CPUs, KNL systems low-res model for climate science, simulate O(5000) years per year high-res model for limited climate science, O(100) years KNL provides a lot of cycles but with similar performance to CPUs (per watt) E3SMv2: Atmosphere dycore and MPAS components running on the GPU Ocean/Ice G cases running on GPUs (2019) Expect performance (per watt) similar to CPUs Coupled B or atmosphere/land F cases: Can we port the physics to GPU in time for v2? Can we get competitive performance?
Code Readiness 2 E3SM-MMF superparameterization Atm/lnd F case running on GPUs (2019) Performance estimated 2x faster (per watt) compared to CPU SCREAM NH dycore running on GPUs 2019 idealized test cases NH cloud resolving, prescribed aerosol physics: 2020 or 2021 3km global model, simulation length: O(10) years First global CRM climate sensitivity studies? GPUs will outperform CPUs
Summary We have a lot of capability to use GPU systems: E3SM-MMF running well on GPU systems MPAS G cases planned for 2019 SCREAM (UHR atmosphere) will run well on GPU systems ~2020 What about multi-decadal simulation campaigns on GPU systems? Problem is the atmosphere physics. Several options: (1) Port v1/v2 physics to GPUs, run slower but more efficiently (2) Run these problems on CPU systems (3) Use E3SM-MMF (superparameterization) (4) Expand SCREAM effort to support lower resolutions
Code Readiness 3 E3SMv3 Full GPU support, 2 possible atmosphere options openACC port of MPAS components Atmosphere (1) NH C++ dycore + openACC port of v2 physics Atmosphere (2): E3SM-MMF E3SMv4 Atmosphere: Merge SCREAM and E3SMv3 aerosols/convection physics MPAS components: transition to task based parallel version?
GPU Porting Work Super-parameterization (SP) in atmosphere using Fortran+OpenACC/OpenMP under ECP (first prototype by end of 2018) Ongoing porting of MPAS ocean/ice using Fortran/OpenACC under ECP (completion early 2019) SCREAM physics (C++/Kokkos) First prototype by end of 2019 SCREAM NH dynamics (C++/Kokkos) completion: 2019 E3SM Performance group: Feasibility study: evaluation of v1/v2 physics using Fortran/OpenACC (Spring 2019)
GPU Simulations E3SM-MMF Super-parameterization: 2019 Outperform CPU systems NH dynamics 3km 2019 Outperform CPU systems SCREAM 3km model 2020 Outperform CPU systems MPAS G cases 2019 Competitive with CPU systems? E3SM v2 low & high resolution 2021? If we can get the E3SM v2 physics ported ( large, expensive effort) GPUs wont speed up the model, but will enable more efficient simulations
A21 Strategy Fortran/openACC code: Early indications suggest code restructuring needed to obtain good openACC GPU performance appears to be appropriate for A21 Push loops down the call stack to expose maximum parallelism Keep loops to relatively small lines of code reduce GPU register pressure and easier compiler processing for improved vectorization and A21 loop analysis Transition openACC code to openMPI by 2021 C++/Kokkos code Current code obtains excellent performance on CPUs, KNL and GPU architectures, suggesting this abstraction is sufficiently powerful to also obtain good performance on A21