CHIUW 2019- Proceedings of the ACM SIGPLAN 6th on Chapel Implementers and Users Workshop

Full Citation in the ACM Digital Library

SESSION: Keynote

Programming abstractions for orchestration of HPC scientific computing (keynote)

Application developers are confronted with three axes of increasing complexity going forward; increasing heterogeneity in computing platforms at all levels, increasing heterogeneity in solvers and data management, and moving existing code bases to future programming models. While the first two will dictate which future programming models may deliver the needed performance, the third will determine their adoption. However, it is clear that the infrastructure backbone of large scale Multiphysics software has to orchestrate data and task movement between devices. The lifecycle of scientific software is several times that of platforms, therefore, any orchestration mechanism must have flexibility and configurability to remain usable on future platforms. In this presentation I will outline a model of an orchestration framework and the demands that it will place on programming models and languages.

SESSION: Chapel Implementation Improvements

GPUIterator: bridging the gap between Chapel and GPU platforms

PGAS (Partitioned Global Address Space) programming models were originally designed to facilitate productive parallel programming at both the intra-node and inter-node levels in homogeneous parallel machines. However, there is a growing need to support accelerators, especially GPU accelerators, in heterogeneous nodes in a cluster. Among high-level PGAS programming languages, Chapel is well suited for this task due to its use of locales and domains to help abstract away low-level details of data and compute mappings for different compute nodes, as well as for different processing units (CPU vs. GPU) within a node.

In this paper, we address some of the key limitations of past approaches on mapping Chapel on to GPUs as follows. First, we introduce a Chapel module, GPUIterator, which is a portable programming interface that supports GPU execution of a Chapel forall loop. This module makes it possible for Chapel programmers to easily use hand-tuned native GPU programs/libraries, which is an important requirement in practice since there is still a big performance gap between compiler-generated GPU code and hand-turned GPU code; hand-optimization of CPU-GPU data transfers is also an important contributor to this performance gap. Second, though Chapel programs are regularly executed on multi-node clusters, past work on GPU enablement of Chapel programs mainly focused on single-node execution. In contrast, our work supports execution across multiple CPU+GPU nodes by accepting Chapel's distributed domains. Third, our approach supports hybrid execution of a Chapel parallel (forall) loop across both a GPU and CPU cores, which is beneficial for specific platforms.

Our preliminary performance evaluations show that the use of the GPUIterator is a promising approach for Chapel programmers to easily utilize a single or multiple CPU+GPU node(s) while maintaining portability.

Calling Chapel code: interoperability improvements

Since CHIUW last year, the Chapel team has undertaken an effort to improve the ability to call Chapel code from other languages. This talk will cover a few areas of improvement: using Chapel code as a library from C, Python, and Fortran; and in addition, improvements to array interoperation.

SESSION: Chapel Performance and Optimization

Towards radix sorting in the Chapel standard library

This talk will discuss recent work improving the Sort module of the Chapel programming language. It will discuss an interface design to support radix sort, describe the implementation of radix sort, compare the performance of this implementation to sort libraries in other language, and finally discuss distributed sorting.

Implementing stencil problems in Chapel: an experience report

Stencil operations represent a fundamental class of algorithms in high-performance computing. We are interested in what level of performance can be expected from a high-productivity language such as Chapel. To this effect we discuss four different implementations of a generic stencil operation with a convergence check after each iteration. We start with a sequential implementation followed by a global-view implementation that we experiment with both on a 16-core multi-core system as well as on a cluster with up to 16 such nodes using domain maps. We finish with a local-view implementation that explicitly encodes all design decisions with respect to parallel execution. This paper is set up as a two stage experience report: We mainly report our findings from the users' perspective without any feedback from the Chapel implementers. We then report additional analysis performed under guidance of the Chapel team.

Our experimental findings show that Chapel performs as expected on a single node. However, it does not achieve the expected levels of performance on our multi-node system, neither with the data-parallel global-view approach, nor with the task-parallel local-view code. We discuss the root causes of our reduced performance in detail and report possible solutions.

Chapel unblocked: recent communication optimizations in Chapel

This talk will highlight communication optimizations made to the Chapel compiler and runtime over the past year. It will focus on improvements to core benchmarks that have benefited from fine-grained and bulk communication optimizations as well as remote task-spawning improvements. Several benchmarks including HPC Challenge (HPCC) RandomAccess, HPCC Stream Triad, and an integer sort code ISx will be briefly introduced, and a relevant performance optimization will be showcased. These benchmarks represent core idioms that are common in many HPC applications. Performance results on up to 1,024 nodes (25,000 cores) will demonstrate that with each release Chapel is becoming more competitive against hand tuned MPI+OpenMP, SHMEM, and UPC.

SESSION: Applications of Chapel

Arkouda: interactive data exploration backed by Chapel

Exploratory data analysis (EDA) is the prerequisite for all data science. EDA is non-negotiably interactive—by far the most popular environment for EDA is a Jupyter notebook—and, as datasets grow, increasingly computationally intensive. Several existing projects attempt to combine interactivity and distributed computation using programming paradigms and tools from cloud computing, but none of these projects have come close to meeting our needs for high-performance EDA. To fill this gap, we have developed a prototype, called arkouda, that allows a user to interactively issue massively parallel computations on distributed data.

Chapel graph library (CGL)

In this talk, I summarize prior work on the Chapel HyperGraph Library (CHGL), the Chapel Aggregation Library (CAL), and introduce the more general Chapel Graph Library (CGL). CGL is being designed to enable global-view programming, such that locality is abstracted from the user. CGL is also being designed in a way that is similar to Chapel's multiresolution design philosophy, where graphs are implemented in terms of hyper graphs, and where both the underlying hypergraph and overlying graphs are available for use. Some of the kinds of graphs being designed are bipartite graphs, directed and undirected graphs, and even trees.

Chapel in Cray HPO

Cray HPO is a component of the data science workflow framework, CrayAI, which provides functionality for Hyperparameter Optimization (HPO) at scale. Machine learning models are commonly defined by a set of hyperparameters that can control aspects such as learning rate, depth or kernel functions that have a large impact on the quality of the trained model. These hyperparameters require tuning in applications where the hyperparameter space is large and model quality is a priority. Cray HPO provides three approaches, the more traditional random and grid searches, as well as a genetic based optimization technique. These are made available through a python interface made possible by Chapel's recently developed python-interoperability features, and the underlying implementations were built using chapel under the hood.