Workshop Program

October 19 (Wednesday)

Time (EDT)	Type	Title	Speaker	Slides
Session: Keynote 1
09:00 - 9:30	Opening Remarks	Workshop Introduction and Overview of the Charm++ Ecosystem	Sanjay Kale
09:30 - 10:30	Keynote	Unstructured-Grid CFD Using Leadership-Class Computing on Emerging Architectures	Eric Nielsen
10:30 - 11:00	Break
Session: Astrophysical Simulations
11:00 - 11:30	Talk	SpECTRE: A task-based spectral code for relativistic astrophysics	Nils Deppe
11:30 - 12:00	Talk	From Cosmology to Planets: the ChaNGa N-Body/hydrodynamics Code	Thomas Quinn
12:00 - 12:15	Talk	Paratreet	Thomas Quinn
12:15 - 12:30	Talk	Loimos: A Large-Scale Epidemic Simulation Framework for Realistic Social Contact Networks	Joy Kitson	Slides
12:30 - 13:30	Break
Session: Charm++
13:30 - 14:00	Talk	Vector Load Balancing in Charm++	Ronak Buch
14:00 - 14:15	Talk	Support for Charm++ on cloud using Kubernetes	Aditya Bhosale
14:15 - 14:45	Talk	Combining Worksharing and Overdecomposition for Improved Communication-Computation Overlap	Michael Robson
14:45 - 15:00	Talk	Applications of Template Metaprogramming based registration to Charm++	Justin Szaday
15:00 - 15:30	Break
Session: Applications 1
15:30 - 16:00	Talk	Leveraging Heterogeneous Architectures with NAMD	David Hardy
16:00 - 16:30	Talk	Scalable GW software for excited electrons using OpenAtom	Kayahan Saritas
16:30 - 17:00	Talk	ExaM2M: Scalable and Adaptive Mesh-to-Mesh Transfer	Eric Bohm
17:00 - 17:30	Talk	Improving Communication Asynchrony and Concurrency in Adaptive MPI	Sam White

Dinner Location: Busboys and Poets

October 20 (Thursday)

Time (EDT)	Type	Title	Speaker
Session: Keynote 2
09:00 - 9:55	Keynote	Community Research Priorities for Next-Generation Scientific Computing	Hal Finkel
9:55 - 10:20	Talk	Efficiency Improvements in ADCIRC Coastal Flood Modeling with AMPI Dynamic Load Balancing	Dylan Wood
10:20 - 10:35	Talk	Quantifying Overheads in Charm++ and HPX using Task Bench	Patrick Diehl
10:35 - 11:00	Break
Session: Applications 2
11:00 - 11:30	Talk	A Performance Evaluation of Adaptive MPI for a Particle-In-Cell Code	Christian Asch
11:30 - 12:00	Talk	Extreme-scale GPU Computational Fluid Dynamics with AMR	Joshua Davis
12:00 - 12:30	Talk	Runtime adaptivity and load balancing in the AMReX framework	Andrew Myers
12:30 - 13:30	Break
Session: Programming Models
13:30 - 14:00	Talk	CharmTyles: Large-scale Interactive Charm++ with Python	Aditya Bhosale
14:00 - 14:30	Talk	A Productive and Scalable Actor-Based Programming System for PGAS Applications	Sri Raj Paul
14:30 - 15:00	Talk	The OpenMP Cluster Programming Model	Guilherme Valarini
15:00 - 15:30	Break
Session: Models and Applications
15:30 - 16:00	Talk	Enzo-E / Cello: Recent advances and future directions	James Bordner
16:00 - 16:30	Talk	LBAF-Viz: A New Application and Library to Visualize Computational Load and Communication Graphs	Nicolas Morales
16:30 - 17:00	Talk	Scalable Heterogeneous Computing with Asynchronous Message-Driven Execution	Jaemin Choi
17:00 - 17:30	Talk	Dynamic load balancing strategies in contact applications	Nicolas Morales
17:30 - 17:40	Closing Remarks

List of Talks

(Keynote) Unstructured-Grid CFD Using Leadership-Class Computing on Emerging Architectures

Eric Nielsen, NASA

About: Eric Nielsen is a Senior Research Scientist with the Computational AeroSciences Branch at NASA Langley Research Center in Hampton, Virginia. He received his PhD in Aerospace Engineering from Virginia Tech and has worked at Langley for the past 30 years. Dr. Nielsen specializes in the development of computational aerodynamics software for the world's most powerful computer systems. The software has been distributed to thousands of organizations around the country and supports major national research and engineering efforts at NASA, in industry, academia, the Department of Defense, and other government agencies. He has published extensively on the subject and has given presentations around the world on his work. He has served as the Principal Investigator on Agency-level projects at NASA as well as leadership-class computational efforts sponsored by the Department of Energy. Dr. Nielsen is a recipient of NASA's Silver Achievement, Exceptional Achievement, and Exceptional Engineering Achievement Medals as well as NASA Langley's HJE Reid Award for best research publication.

Abstract

An overview of challenges and recent efforts to leverage emerging HPC architectures for large-scale unstructured-grid CFD applications is presented. Experiences in applying a broad range of programming models are shown. Node-level performance is established for several recent hardware paradigms and scaling to the current generation of pre-exascale-class systems is demonstrated. Recent simulations with computational throughput equivalent to that of several million CPU cores are shown for human-scale Mars entry retropropulsion concepts, including real-gas effects of O2/CH4 combustion and the Martian CO2 atmosphere. Ongoing work aimed at the imminent era of exascale-class systems is also discussed.

(Keynote) Community Research Priorities for Next-Generation Scientific Computing

Hal Finkel, U.S. Department of Energy

About: Hal is a program manager for computer-science research in the US Department of Energy Office of Science's Advanced Scientific Computing Research (ASCR) program. Prior to joining ASCR, Hal was the Lead for Compiler Technology and Programming Languages at Argonne’s Leadership Computing Facility. As part of DOE's Exascale Computing Project (ECP), Hal was a PathForward technical lead and PI/Co-PI of several multi-institution activities. Hal serves as vice chair of the C++ standards committee. He also helped develop the Hardware/Hybrid Accelerated Cosmology Code (HACC), a two-time IEEE/ACM Gordon Bell Prize finalist. Hal graduated from Yale University in 2011 with a Ph.D. in theoretical physics focusing on numerical simulation of early-universe cosmology.

Abstract

In this talk, I'll review community research priorities for hardware/software co-design, data management, and software development, among other areas, identified as part of activities sponsored by the U.S. Department of Energy’s Advanced Scientific Computing Research program. Future work on programming and runtime systems is expected to have an impact on, and be impacted by, progress in these important areas. I will highlight specific aspects of the identified research directions that directly relate to programming and runtime systems. Finally, I will discuss recent community input on software sustainability for scientific and high-performance computing and highlight some of the requirements for future programming systems that input might imply.

SpECTRE: A task-based spectral code for relativistic astrophysics

Nils Deppe, California Institute of Technology

Abstract

SpECTRE (https://github.com/sxs-collaboration/spectre) is a next-generation multiphysics code implemented using Charm++ and is designed to overcome the limitations of current codes used in relativistic astrophysics. We will present an overview of SpECTRE's capabilities, including simulations of a binary black hole inspiral, producing binary black hole initial data using an discontinuous Galerkin solver, and simulations of neutron stars using a discontinuous Galerkin-finite difference hybrid method.

From Cosmology to Planets: the ChaNGa N-Body/hydrodynamics Code

Thomas Quinn, University of Washington

Abstract

Gravitational dynamics dominates the physics of astrophysical objects of size ranging from the scale of the visible Universe down to individual comets and asteroids. Furthermore the hydrodynamics systems with enormous contrasts in physical densities need to be modeled. The parallel tree-based gravity solver, ChaNGa, uses many of the features in Charm++ to efficiently simulate these astrophysical systems.

Paratreet

Thomas Quinn, University of Washington

Abstract

Loimos: A Large-Scale Epidemic Simulation Framework for Realistic Social Contact Networks

Joy Kitson, University of Maryland, College Park
Ian Costello, University of Maryland, College Park
Diego Jiménez, National High Technology Center
Jiangzhuo Chen, University of Virginia
Jaemin Choi, University of Illinois at Urbana-Champaign
Stefan Hoops, University of Virginia
Tamar Kellner, University of Maryland, College Park
Esteban Meneses, National High Technology Center
Henning Mortveit, University of Virginia
Jae-Seung Yeom, Lawerence Livermore National Laboratory
Laxmikant V. Kale, University of Illinois at Urbana-Champaign
Madhav V., University of Virginia
Abhinav Bhatele, University of Maryland, College Park

Abstract

Global pandemics can wreak havoc and lead to significant social, economic and personal losses. Preventing the spread of infectious diseases requires interventions at different levels needing the study of potential impact and efficacy of those preemptive measures. Modeling epidemic diffusion and possible interventions can help us in this goal. Agent-based models have been used effectively in the past to model contagion processes. We present Loimos, a highly parallel simulation of epidemic diffusion written on top of the Charm++ asynchronous task-based system. Loimos uses a hybrid time-stepped and discrete-event simulation to model disease spread. We demonstrate that our implementation of Loimos is able to efficiently utilize a large number of cores on different HPC platforms, namely, we scale to about 32k cores on Theta at ALCF and about 4k cores on Cori at NERSC.

Vector Load Balancing in Charm++

Ronak Buch, University of Illinois at Urbana-Champaign

Abstract

Load balancing has proven to be critical in achieving scalable, high performance execution of dynamic scientific applications in HPC. While existing schemes of measuring and performing load balancing are effective for many applications, the increasingly complex structure, heterogeneity, and memory and network constraints of modern application are often not fully captured or analyzed in the load balancing context, leading to suboptimal balancing decisions. To address this, we have added support for vector load balancing to Charm++, wherein load is a vector of multiple values. We present the additions to the runtime system that underpin this addition, new load balancing strategies that can exploit this higher granularity data, and the performance improvements it provides to benchmarks and applications.

Support for Charm++ on cloud using Kubernetes

Aditya Bhosale, University of Illinois at Urbana-Champaign
Kavitha Chandrasekar, University of Illinois at Urbana-Champaign
Volodymyr Kindratenko, University of Illinois at Urbana-Champaign
Sanjay Kale, University of Illinois at Urbana-Champaign
Carlos Costa, IBM
Sara Schumacher, IBM
Claudia Misale, IBM
Pedro Bello-Maldonado, IBM

Abstract

In this talk, we will present our work on supporting HPC applications, specifically supporting Charm++ and its applications on Kubernetes-based clouds. We will talk about the operator that allows a user to launch and manage a Charm++ application run on Kubernetes. Additionally we will also talk about our recent efforts on supporting autoscaling with Kubernetes using the shrink-expand feature in Charm++ along with adding Charm++ operator support in Kubernetes to allow dynamically adding or removing nodes/pods for a job.

Combining Worksharing and Overdecomposition for Improved Communication-Computation Overlap

Michael Robson, Villanova University
Laxmikant V. Kale, University of Illinois at Urbana-Champaign

Abstract

In this work, we demonstrate the effectiveness of combining the techniques of overdecomposition and worksharing to improve application performance. Our key insight is that tuning these two parameters in combination optimizes perfor- mance by facilitating the tradeoff of communication overlap and overhead. We explore this new space of potential optimization by varying both the problem size decomposition (grain size) and number of cores assigned to execute a particular task (worksharing). Utilizing these two variables in concert, we can shape the execution timeline of applications in order to more smoothly inject messages on the network, improve cache performance, and decrease the overall execution time of an application. As single-node performance continues to outpace network bandwidth, ensuring smooth and continuous injection of messages into the network will continue to be of crucial importance. Our preliminary results demonstrate a greater than two-fold improvement in performance over a naive OpenMP-only baseline and a thirty percent speedup over the previously best performing implementation of the same code. This work examines the interaction of these two parameters and their potential to increase application performance.

Applications of Template Metaprogramming based registration to Charm++

Aditya Bhosale, University of Illinois at Urbana-Champaign
Justin Szaday, University of Illinois at Urbana-Champaign
Nikunj Gupta, University of Illinois at Urbana-Champaign

Abstract

C++17 provides certain key improvements to the template meta-programming (TMP). We leverage these features in exploring a Charm++ implementation that does not rely on charmxi for registering chares and their entry methods. In this presentation, we will talk about Charmlite, a 'pure' C++17 solution for Charm++ programming leveraging TMP. We will extend the talk to CK-TMP, a 'production' implementation of Charmlite supporting pointer-to-offset optimizations via ck::sapn and macro-based entry method attributes. We have successfully ported LeanMD to this exciting new library!

Leveraging Heterogeneous Architectures with NAMD

David Hardy, University of Illinois at Urbana-Champaign

Abstract

NAMD is a parallel molecular dynamics (MD) application capable of simulating very large systems of biomolecules and made highly scalable through Charm++. One of the largest NAMD simulations has been the SARS-CoV-2 coronavirus within a respiratory aerosol droplet, a system totaling over one billion atoms and simulated on 4096 nodes of the ORNL Summit supercomputer. However, the vast majority of MD practitioners simulate much smaller systems, in the range of 50,000 to 1,000,000 atoms. NAMD has over the past five years developed a new GPU-resident code path in order to better leverage heterogeneous architectures for these smaller systems. This talk presents the recent developments that have improved NAMD performance and scaling on DGX-like architectures. Upcoming heterogeneous architectures will also be discussed together with how NAMD and Charm++ could be extended to support them.

Scalable GW software for excited electrons using OpenAtom

Kayahan Saritas, Yale University
Kavitha Chandrasekar, University of Illinois at Urbana-Champaign
Sohrab Ismail-Beigi, Yale University
Laxmikant V. Kale, University of Illinois at Urbana-Champaign

Abstract

OpenAtom is an open-source, massively parallel software application that performs ab-initio molecular dynamics simulations and ground and excited states calculations utilizing a planewave basis set and relies on the charm++ runtime system. We describe the status of the excited-state GW implementation in OpenAtom: the GW method is an accurate but computationally expensive method for describing dynamics of excited electrons in solid state systems. We will present our progress in implementing an O(N^3) scaling GW method in OpenAtom. In addition to the formalism and physical principles, our data and task parallelization methods and scaling results will be presented.

ExaM2M: Scalable and Adaptive Mesh-to-Mesh Transfer

Eric Bohm, Charmworks, Inc.
Eric Mikida, Charmworks, Inc.
Nitin Bhat, Charmworks, Inc.

Abstract

The ExaM2M project was inspired by desire to model fluid-structure interaction (FSI) for large problems, requiring large meshes distributed across many compute nodes in the context of Quinoa's Inciter code for fluid dynamics. The general problem of managing solution transfer has many applications and presents a variety of challenges. This talk will cover the basic design of ExaM2M, changes made to improve its performance and utility, and recent results.

Improving Communication Asynchrony and Concurrency in Adaptive MPI

Sam White, Charmworks, Inc.

Abstract

Thread-based MPI runtimes, which associate private communication contexts or endpoints with each thread, rather than sharing a single context across a multithreaded process, have been proposed as an alternative to MPI’s traditional multithreading models. Adaptive MPI is one such implementation, which also supports overdecomposition and dynamic load balancing on top of Charm++'s runtime. In this talk, we identify and overcome shortcomings in AMPI's support for asynchronous point-to-point communication. We examine the consequences of MPI’s messaging semantics on its runtime and investigate how its design can be improved for applications that do not require the full messaging semantics. We show that the issues for AMPI reduce to similar problems first identified in the context of efficient MPI+X support. Our focus is on enhancing AMPI’s support for asynchrony and concurrency while still optimizing for communication locality through a novel locality-aware message matching scheme. We compare performance with and without the relaxed messaging semantics and our associated optimizations.

Efficiency Improvements in ADCIRC Coastal Flood Modeling with AMPI Dynamic Load Balancing

Dylan Wood, University of Notre Dame
Joannes Westerink , University of Notre Dame
Damrongsak Wirasaet, University of Notre Dame
Laxmikant V. Kale, University of Illinois at Urbana-Champaign
Laxmikant V. Kale, University of Illinois at Urbana-Champaign
Eric Bohm, Charmworks, Inc.

Abstract

The ADCIRC model -- based on the shallow-water equations and routinely used in academic, private and official contexts -- currently exhibits considerable computational load imbalance when applied in large-scale parallel simulations to study coastal flooding risks due to hurricane storm surge. For recently developed, advanced finite element unstructured meshes -- which typically cover large geographic regions, spanning up to the entire globe -- the imbalance can exceed 50%, as increasing grid resolution is applied in floodplain regions of the model, e.g., areas that are above sea level, since 1) dry areas exert no computational load, 2) hurricane storm surge typically only inundates a small portion of the model and 3) the software currently implements standard MPI and therefore has no mechanism for decoupling each rank (decomposed portion of the mesh) from each CPU. Despite the increased reliability in flood modeling attainable, these meshes are currently restrictive due to their unnecessarily high computational demand, making dynamic load balancing desirable -- especially in time-critical forecasting contexts. We recently developed AMPI implementation within ADCIRC to address such concerns by using over-decomposition and the virtualization capabilities of AMPI to facilitate migrating of ranks between CPUs as the ratio of dry to wet areas in the model evolves over time during the course of a flood simulation. AMPI also facilitates future goals for implementing ADCIRC within a larger framework for simulating compound flooding (e.g., flooding from storm surge and tide, from rain, and from overland runoff) by interfacing several different software exhibiting highly diverse computational demands.

Quantifying Overheads in Charm++ and HPX using Task Bench

Ioannis Gonidelis, Center of Computation and Technology, LSU
Patrick Diehl, Center of Computation and Technology, LSU

Abstract

Asynchronous Many-Task (AMT) runtime systems take advantage of multi-core architectures with light-weight threads, asynchronous executions, and smart scheduling. In this talk, we present the comparison of the AMT systems Charm++ and HPX with the main stream MPI, OpenMP, and MPI+OpenMP libraries using the Task Bench benchmarks. Charm++ is a parallel programming language based on C, supporting stackless tasks as well as light-weight threads asynchronously along with an adaptive runtime system. HPX is a C library for concurrency and parallelism, exposing a C standards conforming API. First, we analyze the commonalities, differences, and advantageous scenarios of Charm++ and HPX in detail. Further, to investigate the potential overheads introduced by the tasking systems of Charm++ and HPX, we utilize an existing parameterized benchmark, Task Bench, wherein 15 different programming systems were implemented, e.g., MPI, OpenMP, MPI + OpenMP, and extend Task Bench by adding HPX implementations. We quantify the overheads of Charm++, HPX, and the main stream libraries in different scenarios where a single task and multi-task are assigned to each core, respectively. We also investigate each system's scalability and the ability to hide the communication latency.

A Performance Evaluation of Adaptive MPI for a Particle-In-Cell Code

Christian Asch, Costa Rica High Technology Center - Advanced Computing National Laboratory
Diego Jimenez, Costa Rica High Technology Center - Advanced Computing National Laboratory
Markus Rampp , Max Planck Computing and Data Facility (MPCDF)
Erwin Laure , Max Planck Computing and Data Facility (MPCDF)
Esteban Meneses, Costa Rica High Technology Center - Advanced Computing National Laboratory

Abstract

In the quest for extreme-scale supercomputers, the High Performance Computing (HPC) community has developed many resources (programming paradigms, architectures, methodologies, numerical methods) to face the multiple challenges along the way. One of those resources are task-based parallel programming tools. The availability of mature programming models, programming languages, and runtime systems that use task-based parallelism represent a favorable ecosystem. The fundamental premise of these tools is their ability to naturally cope with dynamically changing execution conditions, i.e. adaptivity. In this paper, we explore Adaptive MPI, a parallel-object framework, as a mechanism to provide, among other features, automatic and dynamic load balancing for a particle-in-cell application. We ported a pre-existing MPI application on the Adaptive MPI infrastructure and highlight the changes required to the code. Our experimental results show Adaptive MPI has a minimum overhead, maintains the scalability of the original code, and it is able to alleviate an artificially-introduced load imbalance.

Extreme-scale GPU Computational Fluid Dynamics with AMR

Joshua Davis, University of Maryland, College Park
Justin Shafner, University of Maryland, College Park
Daniel Nichols, University of Maryland, College Park
Nathan Grube, University of Maryland, College Park
Pino Martin, University of Maryland, College Park
Abhinav Bhatele, University of Maryland, College Park

Abstract

Accurate modeling of turbulent hypersonic flows has tremendous scientific and commercial value, and applies to atmospheric flight, supersonic combustion, materials discovery and climate prediction. In this talk, we describe our experiences in extending the capabilities of and modernizing CRoCCo, an MPI-based, CPU-only compressible computational fluid dynamics code. We extend CRoCCo to support block-structured adaptive mesh refinement using a highly-scalable AMR library, AMReX, and add support for a fully curvilinear solver. We also share our experiences in porting the computational kernels in CRoCCo to NVIDIA GPUs to enable scaling on modern exascale systems. We present our techniques for overcoming performance challenges and evaluate the updated code, CRoCCo-AMR, on the Summit system, demonstrating a 5× to 24× speedup over the CPU-only version.

Runtime adaptivity and load balancing in the AMReX framework

Andrew Myers, Lawrence Berkeley National Laboratory

Abstract

AMReX is an exascale-ready software framework for building massively parallel block-structured adaptive mesh refinement applications. It forms the basis for the spatial and temporal discretizations of a wide variety of scientific codes, spanning subject areas such as astrophysics and cosmology, combustion research, wind farm modeling, particle accelerators, and more. AMReX uses an "MPI+X" programming model, where "X" is either one of CUDA, HIP, or DPC++ for GPU execution or OpenMP for many-core CPUs. In this talk, I will give an overview of AMReX and its associated ecosystem of application codes, with a particular focus on the tools provided to support runtime adaptivity and dynamic load balancing for multiphysics problems with heterogenous workloads.

CharmTyles: Large-scale Interactive Charm++ with Python

Aditya Bhosale, University of Illinois at Urbana-Champaign
Nikunj Gupta, University of Illinois at Urbana-Champaign
Zane Fink, University of Illinois at Urbana-Champaign
Aryan Sharma, Indian Institute of Technology, Kanpur

Abstract

Python has become a popular choice of language for various applications ranging from scientific computing to machine learning in recent years. Libraries such as numpy, scipy, etc. have made it possible to achieve high performance while maintaining productivity on a single node. Frameworks such as Charm4Py, mpi4py have helped achieve high performance on distributed systems but lack the interactivity that tools like jupyter notebooks provide. CharmTyles aims at providing a set of abstractions working on a client-server model with a python frontend and a Charm++ server on the backend to maintain interactivity while still achieving good performance. In this talk, we present a dense linear algebra abstraction and a stencil computation abstraction based on this programming model.

A Productive and Scalable Actor-Based Programming System for PGAS Applications

Sri Raj Paul, Intel Corporation
Akihiro Hayashi , Georgia Institute of Technology
Kun Chen, Georgia Institute of Technology
Youssef W Elmougy, Georgia Institute of Technology
Vivek Sarkar, Georgia Institute of Technology

Abstract

The Partitioned Global Address Space (PGAS) model is well suited to irregular applications due to its support for short, one-sided messages. However, there are currently two major limitations faced by PGAS applications. The first relates to scalability - despite the availability of APIs that support non-blocking operations in special cases, many PGAS operations on remote locations are synchronous by default, which can lead to long-latency stalls and poor scalability. The second relates to productivity - while it is simpler for the developer to express all communications at a fine-grained granularity that is natural to the application, experiments have shown that such a natural expression results in performance that is 20 times slower than less productive code that requires manual message aggregation and termination detection. The actor model has been gaining popularity as a productive asynchronous message-passing approach for distributed objects in enterprise and cloud computing platforms, typically implemented in languages such as Erlang, Scala or Rust. To the best of our knowledge, there has been no past work on using the actor model to deliver both productivity and scalability to PGAS applications. In this work, we introduce a new programming system for PGAS applications, in which all remote operations are expressed as fine-grained asynchronous actor messages. In this approach, the programmer does not need to worry about programming complexities related to message aggregation and termination detection. Thus, our approach offers a desirable point in the productivity-performance space for PGAS applications, with both scalable performance and higher productivity relative to past approaches.

The OpenMP Cluster Programming Model

Hervé Yviquel, UNICAMP
Guilherme Valarini, UNICAMP
Marcio Pereira, UNICAMP
Guido Araujo, UNICAMP

Abstract

Despite the various research initiatives and proposed programming models, efficient solutions for parallel programming in HPC clusters still rely on a complex combination of different programming models (e.g., OpenMP and MPI), languages (e.g., C++ and CUDA), and specialized runtimes (e.g., Charm++ and Legion). On the other hand, task parallelism has shown to be an efficient and seamless programming model for clusters. This paper introduces OpenMP Cluster (OMPC), a task-parallel model that extends OpenMP for cluster programming. OMPC leverages OpenMP's offloading standard to distribute annotated regions of code across the nodes of a distributed system. To achieve that it hides MPI-based data distribution and load-balancing mechanisms behind OpenMP task dependencies. Given its compliance with OpenMP, OMPC allows applications to use the same programming model to exploit intra- and inter-node parallelism, thus simplifying the development process and maintenance. We evaluated OMPC using Task Bench, a synthetic benchmark focused on task parallelism, comparing its performance against other distributed runtimes. Experimental results show that OMPC can deliver up to 1.53x and 2.43x better performance than Charm++ on CCR and scalability experiments, respectively. Experiments also show that OMPC performance weakly scales for both Task Bench and a real-world seismic imaging application.

Enzo-E / Cello: Recent advances and future directions

James Bordner, University of California, San Diego
Michael Norman, University of California, San Diego

Abstract

Enzo-E is a numerical astrophysics and cosmology application built on the Cello scalable adaptive mesh refinement framework. We will provide a review of the application and framework; discuss recent and ongoing work in improving I/O, adding ML support, and implementing a scalable gravity solver; and outline future efforts in implementing adaptive time-stepping.

LBAF-Viz: A New Application and Library to Visualize Computational Load and Communication Graphs

Nicolas Morales, Sandia National Labs
Philippe Pébaÿ, NexGen Analytics
Marcin Wróbel, NexGen Analytics
Nicole Slattengren, Sandia National Labs
Jonathan Lifflander, Sandia National Labs

Abstract

Reducing or even eliminating load imbalance is not the only important aspect to improve the scalability and overall execution speed of distributed applications. Indeed, inter-task and in particular off-node inter-task communication can also be a severe bottleneck slowing down applications. In order to better comprehend the respective and combined contributions of computational load and communication weight to overall performance, we have developed a analysis tool that can produce both still renderings and mesh-based data apt for interactive visualization, e.g. using ParaView. Furthermore, in addition to visualization capabilities, this tool also provides a general framework to simulate the behavior of load and communication balancing algorithms, that is both easy to use and extend thanks to its modular design implemented in an open-source Python library. The goal of this presentation is therefore two-fold: - first, to explain the general methodology utilized by this visualization and analysis framework, and how it relates to existing load-balancing algorithms; - second, to offer a demonstration of the visualization tool using load and communication data output by actual simulation runs in a task-distributed setting using DARMA/vt. The latter, in particular, is designed to be interactive with the audience as well as to foster discussion regarding the benefits of this tool for performance optimization of distributed applications.

Scalable Heterogeneous Computing with Asynchronous Message-Driven Execution

Jaemin Choi, NVIDIA Corporation

Abstract

Computer systems today are becoming increasingly heterogeneous, in response to rapidly rising performance requirements of both traditional and emerging workloads including computational science, data science, and machine learning, pushing the limits of power and energy imposed by the silicon. Asynchronous message-driven execution, as realized in the Charm++ parallel programming system, is an emerging model that has been proven to be effective in traditional CPU-based systems and large-scale parallel execution due to its adaptive features such as automatic computation-communication overlap and dynamic load balancing. However, when applied to modern heterogeneous and GPU-accelerated systems, asynchronous message-driven execution presents many challenges in realizing overdecomposition and asynchronous progress with low overhead and minimal synchronization between the host and device as well as between the work units executing in parallel. We analyze the issues in realizing efficient asynchronous message-driven execution on modern heterogeneous systems and introduce new capabilities and approaches to address them in the form of runtime support in the Charm++ parallel programming system.

Dynamic load balancing strategies in contact applications

Nicolas Morales, Sandia National Labs
Ulrich Hetmaniuk, NexGen Analytics
Jonathan Lifflander, Sandia National Labs
Reese Jones, Sandia National Labs

Abstract

Contact search and enforcement is a perpetually challenging problem in solid mechanics applications. Anecdotally, it is not uncommon to see over half the total computation time of a solid mechanics simulation devoted to contact. More robust contact schemes (often involving multiple sub-iterations of search and enforcement) are generally more expensive. Simple parallelization of the solid mechanics application, however, is not sufficient to mitigate the performance cost of contact. In practice, the computational load of contact routines can differ greatly from the load characteristics of the non-contact portion of the simulation. This talk will examine the load characteristics of contact problems using the NimbleSM solid mechanics solver and demonstrate an asynchronous many tasking (AMT) approach using dynamic load balancer that can address many of the performance challenges related to these contact problems. Our approach uses the DARMA/vt AMT runtime framework for asynchronous task scheduling and runtime load balancing for the contact routines; the non-contact portions of the application use a traditional MPI communication pattern. In this talk, we will also present experimental results on a variety of contact problems using our implementation. SNL is managed and operated by NTESS under DOE NNSA contract DE-NA0003525