Workshop Program
October 19 (Wednesday)
Dinner Location: Busboys and Poets
October 20 (Thursday)
List of Talks
(Keynote) Unstructured-Grid CFD Using Leadership-Class Computing on Emerging Architectures
Eric Nielsen, NASA
About: Eric Nielsen is a Senior Research Scientist with the Computational AeroSciences Branch at NASA Langley Research Center in Hampton, Virginia. He received his PhD in Aerospace Engineering from Virginia Tech and has worked at Langley for the past 30 years. Dr. Nielsen specializes in the development of computational aerodynamics software for the world's most powerful computer systems. The software has been distributed to thousands of organizations around the country and supports major national research and engineering efforts at NASA, in industry, academia, the Department of Defense, and other government agencies. He has published extensively on the subject and has given presentations around the world on his work. He has served as the Principal Investigator on Agency-level projects at NASA as well as leadership-class computational efforts sponsored by the Department of Energy. Dr. Nielsen is a recipient of NASA's Silver Achievement, Exceptional Achievement, and Exceptional Engineering Achievement Medals as well as NASA Langley's HJE Reid Award for best research publication.
Abstract
An overview of challenges and recent efforts to leverage emerging HPC architectures for large-scale unstructured-grid CFD applications is presented. Experiences in applying a broad range of programming models are shown. Node-level performance is established for several recent hardware paradigms and scaling to the current generation of pre-exascale-class systems is demonstrated. Recent simulations with computational throughput equivalent to that of several million CPU cores are shown for human-scale Mars entry retropropulsion concepts, including real-gas effects of O2/CH4 combustion and the Martian CO2 atmosphere. Ongoing work aimed at the imminent era of exascale-class systems is also discussed.(Keynote) Community Research Priorities for Next-Generation Scientific Computing
- Hal Finkel, U.S. Department of Energy
About: Hal is a program manager for computer-science research in the US Department of Energy Office of Science's Advanced Scientific Computing Research (ASCR) program. Prior to joining ASCR, Hal was the Lead for Compiler Technology and Programming Languages at Argonne’s Leadership Computing Facility. As part of DOE's Exascale Computing Project (ECP), Hal was a PathForward technical lead and PI/Co-PI of several multi-institution activities. Hal serves as vice chair of the C++ standards committee. He also helped develop the Hardware/Hybrid Accelerated Cosmology Code (HACC), a two-time IEEE/ACM Gordon Bell Prize finalist. Hal graduated from Yale University in 2011 with a Ph.D. in theoretical physics focusing on numerical simulation of early-universe cosmology.
Abstract
In this talk, I'll review community research priorities for hardware/software co-design, data management, and software development, among other areas, identified as part of activities sponsored by the U.S. Department of Energy’s Advanced Scientific Computing Research program. Future work on programming and runtime systems is expected to have an impact on, and be impacted by, progress in these important areas. I will highlight specific aspects of the identified research directions that directly relate to programming and runtime systems. Finally, I will discuss recent community input on software sustainability for scientific and high-performance computing and highlight some of the requirements for future programming systems that input might imply.SpECTRE: A task-based spectral code for relativistic astrophysics
Nils Deppe, California Institute of Technology
Abstract
SpECTRE (https://github.com/sxs-collaboration/spectre) is a next-generation multiphysics code implemented using Charm++ and is designed to overcome the limitations of current codes used in relativistic astrophysics. We will present an overview of SpECTRE's capabilities, including simulations of a binary black hole inspiral, producing binary black hole initial data using an discontinuous Galerkin solver, and simulations of neutron stars using a discontinuous Galerkin-finite difference hybrid method.From Cosmology to Planets: the ChaNGa N-Body/hydrodynamics Code
Thomas Quinn, University of Washington
Abstract
Gravitational dynamics dominates the physics of astrophysical objects of size ranging from the scale of the visible Universe down to individual comets and asteroids. Furthermore the hydrodynamics systems with enormous contrasts in physical densities need to be modeled. The parallel tree-based gravity solver, ChaNGa, uses many of the features in Charm++ to efficiently simulate these astrophysical systems.Paratreet
- Thomas Quinn, University of Washington
Abstract
Loimos: A Large-Scale Epidemic Simulation Framework for Realistic Social Contact Networks
- Joy Kitson, University of Maryland, College Park
- Ian Costello, University of Maryland, College Park
- Diego Jiménez, National High Technology Center
- Jiangzhuo Chen, University of Virginia
- Jaemin Choi, University of Illinois at Urbana-Champaign
- Stefan Hoops, University of Virginia
- Tamar Kellner, University of Maryland, College Park
- Esteban Meneses, National High Technology Center
- Henning Mortveit, University of Virginia
- Jae-Seung Yeom, Lawerence Livermore National Laboratory
- Laxmikant V. Kale, University of Illinois at Urbana-Champaign
- Madhav V., University of Virginia
- Abhinav Bhatele, University of Maryland, College Park
Abstract
Global pandemics can wreak havoc and lead to significant social, economic and personal losses. Preventing the spread of infectious diseases requires interventions at different levels needing the study of potential impact and efficacy of those preemptive measures. Modeling epidemic diffusion and possible interventions can help us in this goal. Agent-based models have been used effectively in the past to model contagion processes. We present Loimos, a highly parallel simulation of epidemic diffusion written on top of the Charm++ asynchronous task-based system. Loimos uses a hybrid time-stepped and discrete-event simulation to model disease spread. We demonstrate that our implementation of Loimos is able to efficiently utilize a large number of cores on different HPC platforms, namely, we scale to about 32k cores on Theta at ALCF and about 4k cores on Cori at NERSC.Vector Load Balancing in Charm++
- Ronak Buch, University of Illinois at Urbana-Champaign
Abstract
Load balancing has proven to be critical in achieving scalable, high performance execution of dynamic scientific applications in HPC. While existing schemes of measuring and performing load balancing are effective for many applications, the increasingly complex structure, heterogeneity, and memory and network constraints of modern application are often not fully captured or analyzed in the load balancing context, leading to suboptimal balancing decisions. To address this, we have added support for vector load balancing to Charm++, wherein load is a vector of multiple values. We present the additions to the runtime system that underpin this addition, new load balancing strategies that can exploit this higher granularity data, and the performance improvements it provides to benchmarks and applications.Support for Charm++ on cloud using Kubernetes
- Aditya Bhosale, University of Illinois at Urbana-Champaign
- Kavitha Chandrasekar, University of Illinois at Urbana-Champaign
- Volodymyr Kindratenko, University of Illinois at Urbana-Champaign
- Sanjay Kale, University of Illinois at Urbana-Champaign
- Carlos Costa, IBM
- Sara Schumacher, IBM
- Claudia Misale, IBM
- Pedro Bello-Maldonado, IBM
Abstract
In this talk, we will present our work on supporting HPC applications, specifically supporting Charm++ and its applications on Kubernetes-based clouds. We will talk about the operator that allows a user to launch and manage a Charm++ application run on Kubernetes. Additionally we will also talk about our recent efforts on supporting autoscaling with Kubernetes using the shrink-expand feature in Charm++ along with adding Charm++ operator support in Kubernetes to allow dynamically adding or removing nodes/pods for a job.Combining Worksharing and Overdecomposition for Improved Communication-Computation Overlap
- Michael Robson, Villanova University
- Laxmikant V. Kale, University of Illinois at Urbana-Champaign
Abstract
In this work, we demonstrate the effectiveness of combining the techniques of overdecomposition and worksharing to improve application performance. Our key insight is that tuning these two parameters in combination optimizes perfor- mance by facilitating the tradeoff of communication overlap and overhead. We explore this new space of potential optimization by varying both the problem size decomposition (grain size) and number of cores assigned to execute a particular task (worksharing). Utilizing these two variables in concert, we can shape the execution timeline of applications in order to more smoothly inject messages on the network, improve cache performance, and decrease the overall execution time of an application. As single-node performance continues to outpace network bandwidth, ensuring smooth and continuous injection of messages into the network will continue to be of crucial importance. Our preliminary results demonstrate a greater than two-fold improvement in performance over a naive OpenMP-only baseline and a thirty percent speedup over the previously best performing implementation of the same code. This work examines the interaction of these two parameters and their potential to increase application performance.Applications of Template Metaprogramming based registration to Charm++
- Aditya Bhosale, University of Illinois at Urbana-Champaign
- Justin Szaday, University of Illinois at Urbana-Champaign
- Nikunj Gupta, University of Illinois at Urbana-Champaign
Abstract
C++17 provides certain key improvements to the template meta-programming (TMP). We leverage these features in exploring a Charm++ implementation that does not rely on charmxi for registering chares and their entry methods. In this presentation, we will talk about Charmlite, a 'pure' C++17 solution for Charm++ programming leveraging TMP. We will extend the talk to CK-TMP, a 'production' implementation of Charmlite supporting pointer-to-offset optimizations via ck::sapn and macro-based entry method attributes. We have successfully ported LeanMD to this exciting new library!Leveraging Heterogeneous Architectures with NAMD
- David Hardy, University of Illinois at Urbana-Champaign
Abstract
NAMD is a parallel molecular dynamics (MD) application capable of simulating very large systems of biomolecules and made highly scalable through Charm++. One of the largest NAMD simulations has been the SARS-CoV-2 coronavirus within a respiratory aerosol droplet, a system totaling over one billion atoms and simulated on 4096 nodes of the ORNL Summit supercomputer. However, the vast majority of MD practitioners simulate much smaller systems, in the range of 50,000 to 1,000,000 atoms. NAMD has over the past five years developed a new GPU-resident code path in order to better leverage heterogeneous architectures for these smaller systems. This talk presents the recent developments that have improved NAMD performance and scaling on DGX-like architectures. Upcoming heterogeneous architectures will also be discussed together with how NAMD and Charm++ could be extended to support them.Scalable GW software for excited electrons using OpenAtom
- Kayahan Saritas, Yale University
- Kavitha Chandrasekar, University of Illinois at Urbana-Champaign
- Sohrab Ismail-Beigi, Yale University
- Laxmikant V. Kale, University of Illinois at Urbana-Champaign
Abstract
OpenAtom is an open-source, massively parallel software application that performs ab-initio molecular dynamics simulations and ground and excited states calculations utilizing a planewave basis set and relies on the charm++ runtime system. We describe the status of the excited-state GW implementation in OpenAtom: the GW method is an accurate but computationally expensive method for describing dynamics of excited electrons in solid state systems. We will present our progress in implementing an O(N^3) scaling GW method in OpenAtom. In addition to the formalism and physical principles, our data and task parallelization methods and scaling results will be presented.ExaM2M: Scalable and Adaptive Mesh-to-Mesh Transfer
- Eric Bohm, Charmworks, Inc.
- Eric Mikida, Charmworks, Inc.
- Nitin Bhat, Charmworks, Inc.
Abstract
The ExaM2M project was inspired by desire to model fluid-structure interaction (FSI) for large problems, requiring large meshes distributed across many compute nodes in the context of Quinoa's Inciter code for fluid dynamics. The general problem of managing solution transfer has many applications and presents a variety of challenges. This talk will cover the basic design of ExaM2M, changes made to improve its performance and utility, and recent results.Improving Communication Asynchrony and Concurrency in Adaptive MPI
- Sam White, Charmworks, Inc.
Abstract
Thread-based MPI runtimes, which associate private communication contexts or endpoints with each thread, rather than sharing a single context across a multithreaded process, have been proposed as an alternative to MPI’s traditional multithreading models. Adaptive MPI is one such implementation, which also supports overdecomposition and dynamic load balancing on top of Charm++'s runtime. In this talk, we identify and overcome shortcomings in AMPI's support for asynchronous point-to-point communication. We examine the consequences of MPI’s messaging semantics on its runtime and investigate how its design can be improved for applications that do not require the full messaging semantics. We show that the issues for AMPI reduce to similar problems first identified in the context of efficient MPI+X support. Our focus is on enhancing AMPI’s support for asynchrony and concurrency while still optimizing for communication locality through a novel locality-aware message matching scheme. We compare performance with and without the relaxed messaging semantics and our associated optimizations.Efficiency Improvements in ADCIRC Coastal Flood Modeling with AMPI Dynamic Load Balancing
- Dylan Wood, University of Notre Dame
- Joannes Westerink , University of Notre Dame
- Damrongsak Wirasaet, University of Notre Dame
- Laxmikant V. Kale, University of Illinois at Urbana-Champaign
- Laxmikant V. Kale, University of Illinois at Urbana-Champaign
- Eric Bohm, Charmworks, Inc.
Abstract
The ADCIRC model -- based on the shallow-water equations and routinely used in academic, private and official contexts -- currently exhibits considerable computational load imbalance when applied in large-scale parallel simulations to study coastal flooding risks due to hurricane storm surge. For recently developed, advanced finite element unstructured meshes -- which typically cover large geographic regions, spanning up to the entire globe -- the imbalance can exceed 50%, as increasing grid resolution is applied in floodplain regions of the model, e.g., areas that are above sea level, since 1) dry areas exert no computational load, 2) hurricane storm surge typically only inundates a small portion of the model and 3) the software currently implements standard MPI and therefore has no mechanism for decoupling each rank (decomposed portion of the mesh) from each CPU. Despite the increased reliability in flood modeling attainable, these meshes are currently restrictive due to their unnecessarily high computational demand, making dynamic load balancing desirable -- especially in time-critical forecasting contexts. We recently developed AMPI implementation within ADCIRC to address such concerns by using over-decomposition and the virtualization capabilities of AMPI to facilitate migrating of ranks between CPUs as the ratio of dry to wet areas in the model evolves over time during the course of a flood simulation. AMPI also facilitates future goals for implementing ADCIRC within a larger framework for simulating compound flooding (e.g., flooding from storm surge and tide, from rain, and from overland runoff) by interfacing several different software exhibiting highly diverse computational demands.Quantifying Overheads in Charm++ and HPX using Task Bench
- Ioannis Gonidelis, Center of Computation and Technology, LSU
- Patrick Diehl, Center of Computation and Technology, LSU
Abstract
Asynchronous Many-Task (AMT) runtime systems take advantage of multi-core architectures with light-weight threads, asynchronous executions, and smart scheduling. In this talk, we present the comparison of the AMT systems Charm++ and HPX with the main stream MPI, OpenMP, and MPI+OpenMP libraries using the Task Bench benchmarks. Charm++ is a parallel programming language based on C, supporting stackless tasks as well as light-weight threads asynchronously along with an adaptive runtime system. HPX is a C library for concurrency and parallelism, exposing a C standards conforming API. First, we analyze the commonalities, differences, and advantageous scenarios of Charm++ and HPX in detail. Further, to investigate the potential overheads introduced by the tasking systems of Charm++ and HPX, we utilize an existing parameterized benchmark, Task Bench, wherein 15 different programming systems were implemented, e.g., MPI, OpenMP, MPI + OpenMP, and extend Task Bench by adding HPX implementations. We quantify the overheads of Charm++, HPX, and the main stream libraries in different scenarios where a single task and multi-task are assigned to each core, respectively. We also investigate each system's scalability and the ability to hide the communication latency.A Performance Evaluation of Adaptive MPI for a Particle-In-Cell Code
- Christian Asch, Costa Rica High Technology Center - Advanced Computing National Laboratory
- Diego Jimenez, Costa Rica High Technology Center - Advanced Computing National Laboratory
- Markus Rampp , Max Planck Computing and Data Facility (MPCDF)
- Erwin Laure , Max Planck Computing and Data Facility (MPCDF)
- Esteban Meneses, Costa Rica High Technology Center - Advanced Computing National Laboratory
Abstract
In the quest for extreme-scale supercomputers, the High Performance Computing (HPC) community has developed many resources (programming paradigms, architectures, methodologies, numerical methods) to face the multiple challenges along the way. One of those resources are task-based parallel programming tools. The availability of mature programming models, programming languages, and runtime systems that use task-based parallelism represent a favorable ecosystem. The fundamental premise of these tools is their ability to naturally cope with dynamically changing execution conditions, i.e. adaptivity. In this paper, we explore Adaptive MPI, a parallel-object framework, as a mechanism to provide, among other features, automatic and dynamic load balancing for a particle-in-cell application. We ported a pre-existing MPI application on the Adaptive MPI infrastructure and highlight the changes required to the code. Our experimental results show Adaptive MPI has a minimum overhead, maintains the scalability of the original code, and it is able to alleviate an artificially-introduced load imbalance.Extreme-scale GPU Computational Fluid Dynamics with AMR
- Joshua Davis, University of Maryland, College Park
- Justin Shafner, University of Maryland, College Park
- Daniel Nichols, University of Maryland, College Park
- Nathan Grube, University of Maryland, College Park
- Pino Martin, University of Maryland, College Park
- Abhinav Bhatele, University of Maryland, College Park
Abstract
Accurate modeling of turbulent hypersonic flows has tremendous scientific and commercial value, and applies to atmospheric flight, supersonic combustion, materials discovery and climate prediction. In this talk, we describe our experiences in extending the capabilities of and modernizing CRoCCo, an MPI-based, CPU-only compressible computational fluid dynamics code. We extend CRoCCo to support block-structured adaptive mesh refinement using a highly-scalable AMR library, AMReX, and add support for a fully curvilinear solver. We also share our experiences in porting the computational kernels in CRoCCo to NVIDIA GPUs to enable scaling on modern exascale systems. We present our techniques for overcoming performance challenges and evaluate the updated code, CRoCCo-AMR, on the Summit system, demonstrating a 5× to 24× speedup over the CPU-only version.Runtime adaptivity and load balancing in the AMReX framework
- Andrew Myers, Lawrence Berkeley National Laboratory
Abstract
AMReX is an exascale-ready software framework for building massively parallel block-structured adaptive mesh refinement applications. It forms the basis for the spatial and temporal discretizations of a wide variety of scientific codes, spanning subject areas such as astrophysics and cosmology, combustion research, wind farm modeling, particle accelerators, and more. AMReX uses an "MPI+X" programming model, where "X" is either one of CUDA, HIP, or DPC++ for GPU execution or OpenMP for many-core CPUs. In this talk, I will give an overview of AMReX and its associated ecosystem of application codes, with a particular focus on the tools provided to support runtime adaptivity and dynamic load balancing for multiphysics problems with heterogenous workloads.CharmTyles: Large-scale Interactive Charm++ with Python
- Aditya Bhosale, University of Illinois at Urbana-Champaign
- Nikunj Gupta, University of Illinois at Urbana-Champaign
- Zane Fink, University of Illinois at Urbana-Champaign
- Aryan Sharma, Indian Institute of Technology, Kanpur
Abstract
Python has become a popular choice of language for various applications ranging from scientific computing to machine learning in recent years. Libraries such as numpy, scipy, etc. have made it possible to achieve high performance while maintaining productivity on a single node. Frameworks such as Charm4Py, mpi4py have helped achieve high performance on distributed systems but lack the interactivity that tools like jupyter notebooks provide. CharmTyles aims at providing a set of abstractions working on a client-server model with a python frontend and a Charm++ server on the backend to maintain interactivity while still achieving good performance. In this talk, we present a dense linear algebra abstraction and a stencil computation abstraction based on this programming model.A Productive and Scalable Actor-Based Programming System for PGAS Applications
- Sri Raj Paul, Intel Corporation
- Akihiro Hayashi , Georgia Institute of Technology
- Kun Chen, Georgia Institute of Technology
- Youssef W Elmougy, Georgia Institute of Technology
- Vivek Sarkar, Georgia Institute of Technology
Abstract
The Partitioned Global Address Space (PGAS) model is well suited to irregular applications due to its support for short, one-sided messages. However, there are currently two major limitations faced by PGAS applications. The first relates to scalability - despite the availability of APIs that support non-blocking operations in special cases, many PGAS operations on remote locations are synchronous by default, which can lead to long-latency stalls and poor scalability. The second relates to productivity - while it is simpler for the developer to express all communications at a fine-grained granularity that is natural to the application, experiments have shown that such a natural expression results in performance that is 20 times slower than less productive code that requires manual message aggregation and termination detection. The actor model has been gaining popularity as a productive asynchronous message-passing approach for distributed objects in enterprise and cloud computing platforms, typically implemented in languages such as Erlang, Scala or Rust. To the best of our knowledge, there has been no past work on using the actor model to deliver both productivity and scalability to PGAS applications. In this work, we introduce a new programming system for PGAS applications, in which all remote operations are expressed as fine-grained asynchronous actor messages. In this approach, the programmer does not need to worry about programming complexities related to message aggregation and termination detection. Thus, our approach offers a desirable point in the productivity-performance space for PGAS applications, with both scalable performance and higher productivity relative to past approaches.The OpenMP Cluster Programming Model
- Hervé Yviquel, UNICAMP
- Guilherme Valarini, UNICAMP
- Marcio Pereira, UNICAMP
- Guido Araujo, UNICAMP
Abstract
Despite the various research initiatives and proposed programming models, efficient solutions for parallel programming in HPC clusters still rely on a complex combination of different programming models (e.g., OpenMP and MPI), languages (e.g., C++ and CUDA), and specialized runtimes (e.g., Charm++ and Legion). On the other hand, task parallelism has shown to be an efficient and seamless programming model for clusters. This paper introduces OpenMP Cluster (OMPC), a task-parallel model that extends OpenMP for cluster programming. OMPC leverages OpenMP's offloading standard to distribute annotated regions of code across the nodes of a distributed system. To achieve that it hides MPI-based data distribution and load-balancing mechanisms behind OpenMP task dependencies. Given its compliance with OpenMP, OMPC allows applications to use the same programming model to exploit intra- and inter-node parallelism, thus simplifying the development process and maintenance. We evaluated OMPC using Task Bench, a synthetic benchmark focused on task parallelism, comparing its performance against other distributed runtimes. Experimental results show that OMPC can deliver up to 1.53x and 2.43x better performance than Charm++ on CCR and scalability experiments, respectively. Experiments also show that OMPC performance weakly scales for both Task Bench and a real-world seismic imaging application.Enzo-E / Cello: Recent advances and future directions
- James Bordner, University of California, San Diego
- Michael Norman, University of California, San Diego
Abstract
Enzo-E is a numerical astrophysics and cosmology application built on the Cello scalable adaptive mesh refinement framework. We will provide a review of the application and framework; discuss recent and ongoing work in improving I/O, adding ML support, and implementing a scalable gravity solver; and outline future efforts in implementing adaptive time-stepping.LBAF-Viz: A New Application and Library to Visualize Computational Load and Communication Graphs
- Nicolas Morales, Sandia National Labs
- Philippe Pébaÿ, NexGen Analytics
- Marcin Wróbel, NexGen Analytics
- Nicole Slattengren, Sandia National Labs
- Jonathan Lifflander, Sandia National Labs
Abstract
Reducing or even eliminating load imbalance is not the only important aspect to improve the scalability and overall execution speed of distributed applications. Indeed, inter-task and in particular off-node inter-task communication can also be a severe bottleneck slowing down applications. In order to better comprehend the respective and combined contributions of computational load and communication weight to overall performance, we have developed a analysis tool that can produce both still renderings and mesh-based data apt for interactive visualization, e.g. using ParaView. Furthermore, in addition to visualization capabilities, this tool also provides a general framework to simulate the behavior of load and communication balancing algorithms, that is both easy to use and extend thanks to its modular design implemented in an open-source Python library. The goal of this presentation is therefore two-fold: - first, to explain the general methodology utilized by this visualization and analysis framework, and how it relates to existing load-balancing algorithms; - second, to offer a demonstration of the visualization tool using load and communication data output by actual simulation runs in a task-distributed setting using DARMA/vt. The latter, in particular, is designed to be interactive with the audience as well as to foster discussion regarding the benefits of this tool for performance optimization of distributed applications.Scalable Heterogeneous Computing with Asynchronous Message-Driven Execution
- Jaemin Choi, NVIDIA Corporation
Abstract
Computer systems today are becoming increasingly heterogeneous, in response to rapidly rising performance requirements of both traditional and emerging workloads including computational science, data science, and machine learning, pushing the limits of power and energy imposed by the silicon. Asynchronous message-driven execution, as realized in the Charm++ parallel programming system, is an emerging model that has been proven to be effective in traditional CPU-based systems and large-scale parallel execution due to its adaptive features such as automatic computation-communication overlap and dynamic load balancing. However, when applied to modern heterogeneous and GPU-accelerated systems, asynchronous message-driven execution presents many challenges in realizing overdecomposition and asynchronous progress with low overhead and minimal synchronization between the host and device as well as between the work units executing in parallel. We analyze the issues in realizing efficient asynchronous message-driven execution on modern heterogeneous systems and introduce new capabilities and approaches to address them in the form of runtime support in the Charm++ parallel programming system.Dynamic load balancing strategies in contact applications
- Nicolas Morales, Sandia National Labs
- Ulrich Hetmaniuk, NexGen Analytics
- Jonathan Lifflander, Sandia National Labs
- Reese Jones, Sandia National Labs