PPoPP '20: Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

Full Citation in the ACM Digital Library

Kite: efficient and available release consistency for the datacenter

Key-Value Stores (KVSs) came into prominence as highly-available, eventually consistent (EC), "NoSQL" Databases, but have quickly transformed into general-purpose, programmable storage systems. Thus, EC, while relevant, is no longer sufficient. Complying with the emerging requirements for stronger consistency, researchers have proposed KVSs with multiple consistency levels (MCL) that expose the consistency/performance trade-off to the programmer. We argue that this approach falls short in both programmability and performance. For instance, the MCL APIs proposed thus far, fail to capture the ordering relationship between strongly- and weakly-consistent accesses that naturally occur in programs.

Taking inspiration from shared memory, we advocate Release Consistency (RC) for KVSs. We argue that RC's onesided barriers are ideal for capturing the ordering relationship between synchronization and non-synchronization accesses while enabling high-performance.

We present Kite, the first highly-available, replicated KVS that offers a linearizable variant of RC for the asynchronous setting with individual process and network failures. Kite enforces RC barriers through a novel fast/slow path mechanism that leverages the absence of failures in the typical case to maximize performance while relying on the slow path for progress. Our evaluation shows that the RDMA-enabled and heavily-multithreaded Kite achieves orders of magnitude better performance than Derecho (a state-of-the-art RDMA-enabled state machine replication system) and significantly outperforms ZAB (the protocol at the heart of Zookeeper). We demonstrate the efficacy of Kite by porting three lock-free shared memory data structures, and showing that Kite outperforms the competition.

Oak: a scalable off-heap allocated key-value map

Efficient ordered in-memory key-value (KV-)maps are paramount for the scalability of modern data platforms. In managed languages like Java, KV-maps face unique challenges due to the high overhead of garbage collection (GC).

We present Oak, a scalable concurrent KV-map for environments with managed memory. Oak offloads data from the managed heap, thereby reducing GC overheads and improving memory utilization. An important consideration in this context is the programming model since a standard object-based API entails moving data between the on- and off-heap spaces. In order to avoid the cost associated with such movement, we introduce a novel zero-copy (ZC) API. It provides atomic get, put, remove, and various conditional put operations such as compute (in-situ update).

We have released an open-source Java version of Oak. We further present a prototype Oak-based implementation of the internal multidimensional index in Apache Druid. Our experiments show that Oak is often 2x faster than Java's state-of-the-art concurrent skiplist.

Optimizing batched winograd convolution on GPUs

In this paper, we present an optimized implementation for single-precision Winograd convolution on NVIDIA Volta and Turing GPUs. Compared with the state-of-the-art Winograd convolution in cuDNN 7.6.1, our implementation achieves up to 2.13X speedup on Volta V100 and up to 2.65X speedup on Turing RTX2070. On both Volta and Turing GPUs, our implementation achieves up to 93% of device peak.

Apart from analyzing and benchmarking different high-level optimization options, we also build a SASS assembler TuringAs for Volta and Turing that enables tuning the performance at the native assembly level. The new optimization opportunities uncovered by TuringAs not only improve the Winograd convolution but can also benefit CUDA compilers and native assembly programming. We have released TuringAs as an open-source software. To the best of our knowledge, this is the first public-available assembler for Volta and Turing GPUs.

Taming unbalanced training workloads in deep learning with partial collective operations

Load imbalance pervasively exists in distributed deep learning training systems, either caused by the inherent imbalance in learned tasks or by the system itself. Traditional synchronous Stochastic Gradient Descent (SGD) achieves good accuracy for a wide variety of tasks, but relies on global synchronization to accumulate the gradients at every training step. In this paper, we propose eager-SGD, which relaxes the global synchronization for decentralized accumulation. To implement eager-SGD, we propose to use two partial collectives: solo and majority. With solo allreduce, the faster processes contribute their gradients eagerly without waiting for the slower processes, whereas with majority allreduce, at least half of the participants must contribute gradients before continuing, all without using a central parameter server. We theoretically prove the convergence of the algorithms and describe the partial collectives in detail. Experiments are conducted on a variety of neural networks and datasets. The results on load-imbalanced environments show that eager-SGD achieves 2.64 X speedup (ResNet-50 on ImageNet) over the asynchronous centralized SGD, and achieves 1.29 X speedup (ResNet-50 on ImageNet) and 1.27X speedup (LSTM on UCF101) over the state-of-the-art synchronous decentralized SGDs, without losing accuracy.

Scalable top-k retrieval with Sparta

Many big data processing applications rely on a top-k retrieval building block, which selects (or approximates) the k highest-scoring data items based on an aggregation of features. In web search, for instance, a document's score is the sum of its scores for all query terms. Top-k retrieval is often used to sift through massive data and identify a smaller subset of it for further analysis. Because it filters out the bulk of the data, it often constitutes the main performance bottleneck.

Beyond the rise in data sizes, today's data processing scenarios also increase the number of features contributing to the overall score. In web search, for example, verbose queries are becoming mainstream, while state-of-the-art algorithms fail to process long queries in real-time.

We present Sparta, a practical parallel algorithm that exploits multi-core hardware for fast (approximate) top-k retrieval. Thanks to lightweight coordination and judicious context sharing among threads, Sparta scales both in the number of features and in the searched index size. In our web search case study on 50M documents, Sparta processes 12-term queries more than twice as fast as the state-of-the-art. On a tenfold bigger index, Sparta processes queries at the same speed, whereas the average latency of existing algorithms soars to be an order-of-magnitude larger than Sparta's.

waveSZ: a hardware-algorithm co-design of efficient lossy compression for scientific data

Error-bounded lossy compression is critical to the success of extreme-scale scientific research because of ever-increasing volumes of data produced by today's high-performance computing (HPC) applications. Not only can error-controlled lossy compressors significantly reduce the I/O and storage burden but they can retain high data fidelity for post analysis. Existing state-of-the-art lossy compressors, however, generally suffer from relatively low compression and decompression throughput (up to hundreds of megabytes per second on a single CPU core), which considerably restrict the adoption of lossy compression by many HPC applications especially those with a fairly high data production rate. In this paper, we propose a highly efficient lossy compression approach based on field programmable gate arrays (FPGAs) under the state-of-the-art lossy compression model SZ. Our contributions are fourfold. (1) We adopt a wavefront memory layout to alleviate the data dependency during the prediction for higher-dimensional predictors, such as the Lorenzo predictor. (2) We propose a co-design framework named waveSZ based on the wavefront memory layout and the characteristics of SZ algorithm and carefully implement it by using high-level synthesis. (3) We propose a hardware-algorithm co-optimization method to improve the performance. (4) We evaluate our proposed waveSZ on three real-world HPC simulation datasets from the Scientific Data Reduction Benchmarks and compare it with other state-of-the-art methods on both CPUs and FPGAs. Experiments show that our waveSZ can improve SZ's compression throughput by 6.9X ~ 8.7X over the production version running on a state-of-the-art CPU and improve the compression ratio and throughput by 2.1X and 5.8X on average, respectively, compared with the state-of-the-art FPGA design.

Scaling concurrent queues by using HTM to profit from failed atomic operations

Queues are fundamental concurrent data structures, but despite years of research, even the state-of-the-art queues scale poorly. This poor scalability occurs because of contended atomic read-modify-write (RMW) operations.

This paper makes a first step towards designing a scalable linearizable queue. We leverage hardware transactional memory (HTM) to design TxCAS, a scalable compare-and-set (CAS) primitive---despite HTM being targeted mainly at uncontended scenarios.

Leveraging TxCAS's scalability requires a queue design that does not blindly retry failed CASs. We thus apply TxCAS to the baskets queue, which steers enqueuers whose CAS fails into dedicated basket data structures. Coupled with a new, scalable basket algorithm, we obtain SBQ, the scalable baskets queue. At high concurrency levels, SBQ outperforms the fastest queue today by 1.6X on a producer-only workload.

A wait-free universal construction for large objects

Concurrency has been a subject of study for more than 50 years. Still, many developers struggle to adapt their sequential code to be accessed concurrently. This need has pushed for generic solutions and specific concurrent data structures.

Wait-free universal constructions are attractive as they can turn a sequential implementation of any object into an equivalent, yet concurrent and wait-free, implementation. Existing universal constructions either maintain a per-thread replica of the object, or make a new copy of the object on every operation, with both approaches yielding low performance.

To overcome these limitations, we have designed CX, a multi-instance-based wait-free universal construction that shares replicas among threads using reader-writer locks.

CX performs significantly better than existing wait-free constructions for our tested sequential data structures. Moreover, on read-oriented workloads, CX rivals hand-written lock-free and wait-free data structures. Our universal construction is the first to provide integrated wait-free memory reclamation.

Fast concurrent data sketches

Data sketches are approximate succinct summaries of long data streams. They are widely used for processing massive amounts of data and answering statistical queries about it. Existing libraries producing sketches are very fast, but do not allow parallelism for creating sketches using multiple threads or querying them while they are being built. We present a generic approach to parallelising data sketches efficiently and allowing them to be queried in real time, while bounding the error that such parallelism introduces. Utilising relaxed semantics and the notion of strong linearisability we prove our algorithm's correctness and analyse the error it induces in two specific sketches. Our implementation achieves high scalability while keeping the error small. We have contributed one of our concurrent sketches to the open-source data sketches library.

Universal wait-free memory reclamation

In this paper, we present a universal memory reclamation scheme, Wait-Free Eras (WFE), for deleted memory blocks in wait-free concurrent data structures. WFE's key innovation is that it is completely wait-free. Although some prior techniques provide similar guarantees for certain data structures, they lack support for arbitrary wait-free data structures. Consequently, developers are typically forced to marry their wait-free data structures with lock-free Hazard Pointers or (potentially blocking) epoch-based memory reclamation. Since both these schemes provide weaker progress guarantees, they essentially forfeit the strong progress guarantee of wait-free data structures. Though making the original Hazard Pointers scheme or epoch-based reclamation completely wait-free seems infeasible, we achieved this goal with a more recent, (lock-free) Hazard Eras scheme, which we extend to guarantee wait-freedom. As this extension is non-trivial, we discuss all challenges pertaining to the construction of universal wait-free memory reclamation.

WFE is implementable on ubiquitous x86_64 and AArch64 (ARM) architectures. Its API is mostly compatible with Hazard Pointers, which allows easy transitioning of existing data structures into WFE. Our experimental evaluations show that WFE's performance is close to epoch-based reclamation and almost matches the original Hazard Eras scheme, while providing the stronger wait-free progress guarantee.

Using sample-based time series data for automated diagnosis of scalability losses in parallel programs

The performance of many parallel applications has failed to scale as fast as successive generations of hardware on which these applications execute. To understand the cause of scalability losses, experts use performance tools to monitor and analyze application behavior. Profiles generated by performance tools can usually indicate the presence of scalability losses while time series data are generally necessary to pinpoint the root causes of such losses. However, manual analysis of time series data can be difficult in executions with a large number of processes, long running times, and deep call chains. This paper describes an automated framework that analyzes sample-based time series data to diagnose scalability losses in parallel executions. The framework's automated diagnosis of scalability losses indicates their symptoms, severity, and causes. Two case studies illustrate the effectiveness of this framework. When compared to a tool that analyzes performance using instrumentation-based traces, our overhead for collecting sample-based time series is 1/28 in time and 1/1600 in space while our automated analysis takes 1/25 of the time.

Scaling out speculative execution of finite-state machines with parallel merge

A finite-state machine (FSM) is a key component for many important applications, such as Huffman decoding, regular expression matching and HTML tokenization. Due to its inherent dependencies and unpredictable memory access pattern, FSM computations are considered to be extremely difficult to parallelize. As such, significant research efforts have been made to accelerate FSM computations. Although they achieve promising performance results on multi-core machines, these methods are not scalable for emerging many-core architectures such as the GPUs.

Based on our experiments, we point out that the bottleneck of achieving scalability on GPUs is the sequential merge inherent to these methods. However, unlike the case for simple reduction loops, parallel merge implementations for FSM computations typically require runtime checks and re-executions, which can also impede performance. Based on these observations, we develop parallel merge techniques that select efficient runtime check implementations and avoids unnecessary re-executions. Further, based on GPU architectural features, we develop optimization techniques to improve performance.

We evaluate our parallel merge implementations on a set of representative algorithms. Experimental results show that our parallel merge implementations are 2.02-6.74 times more efficient than corresponding sequential merge implementations and achieve better scalability on an Nvidia V100 GPU.

On the fly MHP analysis

May-Happen-in-Parallel (MHP) analysis forms the basis for many problems of program analysis and program understanding. MHP analysis can also be used by IDEs (integrated-development-environments) to help programmers to refactor parallel-programs, identify racy programs, understand which parts of the program run in parallel, and so on. Since the code keeps changing in the IDE, re-computing the MHP information after every change can be an expensive affair. In this manuscript, we propose a novel scheme to perform incremental MHP analysis (on the fly) of programs written in task parallel languages like X10 to keep the MHP information up to date, in an IDE environment.

The key insight of our proposed approach to maintain the MHP information up to date is that we need not rebuild (from scratch) every data structure related to MHP information, after each modification (addition or deletion of statements) in the source code. The idea is to reuse the old MHP information as much as possible and incrementally recompute the MHP information (of a small set of statements) which depends on the statement added/removed. We introduce two new algorithms that deal with addition and removal of parallel constructs like finish, async, atomic, and sequential constructs like loop, if, if-else and other sequential statements, on the fly. Our evaluation shows that our algorithms run much faster than the repeated invocations of the fastest known MHP analysis for X10 programs [Sankar et al. 2016].

Detecting and reproducing error-code propagation bugs in MPI implementations

We present an approach to automatically detect and reproduce error code propagation bugs in MPI implementations. Specifically, we combine static analysis and program repair for bug detection, and apply fault injection to reproduce error propagation bugs found in MPI libraries written in C. We demonstrate our approach on the MPICH library, one of the most popular implementations of MPI, and the MPICH-based implementation MVAPICH, uncovering 447 previously unknown bugs. We discovered that 31 of these bugs result in program crashes, and 60% of the MPICH test suite is susceptible to crashing due to failures to propagate error codes. Moreover, 95 bugs produce undesirable behavior that has been confirmed dynamically, causing tests to fail, hanging processes, or simply dropping error codes before reaching user applications.

Parallel and distributed bounded model checking of multi-threaded programs

We introduce a structure-aware parallel technique for context-bounded analysis of concurrent programs. The key intuition consists in decomposing the set of concurrent traces into symbolic subsets that are separately explored by multiple instances of the same decision procedure running in parallel. The decision procedures work on different partitions of the search space without cooperating, whence distribution follows effortlessly. Our experiments on a selection of complex multi-threaded programs show significant analysis speedups and scalability, and greater performance gains than with general-purpose parallel solvers.

Parallel determinacy race detection for futures

The use of futures can generate arbitrary dependences in the computation, making it difficult to detect races efficiently. Algorithms proposed by prior work to detect races on programs with futures all have to execute the program sequentially. We propose F-Order, the first known parallel race detection algorithm that detects races on programs that use futures. Given a computation with work T1 and span T, our algorithm detects races in time O((T1 lg k + k2)/P + T(k + lg r lg k)) processors, where k is the number of future operations, r is the maximum number of readers per memory location, and k is the maximum number of future operations done by a single future task, which is typically small. We have also implemented a prototype system based on the proposed algorithm and empirically demonstrates its practical efficiency and scalability.

Practical parallel hypergraph algorithms

While there has been significant work on parallel graph processing, there has been very surprisingly little work on high-performance hypergraph processing. This paper presents a collection of efficient parallel algorithms for hypergraph processing, including algorithms for betweenness centrality, maximal independent set, k-core decomposition, hypertrees, hyperpaths, connected components, PageRank, and single-source shortest paths. For these problems, we either provide new parallel algorithms or more efficient implementations than prior work. Furthermore, our algorithms are theoretically-efficient in terms of work and depth. To implement our algorithms, we extend the Ligra graph processing framework to support hypergraphs, and our implementations benefit from graph optimizations including switching between sparse and dense traversals based on the frontier size, edge-aware parallelization, using buckets to prioritize processing of vertices, and compression. Our experiments on a 72-core machine and show that our algorithms obtain excellent parallel speedups, and are significantly faster than algorithms in existing hypergraph processing frameworks.

A supernodal all-pairs shortest path algorithm

We show how to exploit graph sparsity in the Floyd-Warshall algorithm for the all-pairs shortest path (Apsp) problem. Floyd-Warshall is an attractive choice for Apsp on high-performing systems due to its structural similarity to solving dense linear systems and matrix multiplication. However, if sparsity of the input graph is not properly exploited, Floyd-Warshall will perform unnecessary asymptotic work and thus may not be a suitable choice for many input graphs. To overcome this limitation, the key idea in our approach is to use the known algebraic relationship between Floyd-Warshall and Gaussian elimination, and import several algorithmic techniques from sparse Cholesky factorization, namely, fill-in reducing ordering, symbolic analysis, supernodal traversal, and elimination tree parallelism. When combined, these techniques reduce computation, improve locality and enhance parallelism. We implement these ideas in an efficient shared memory parallel prototype that is orders of magnitude faster than an efficient multi-threaded baseline Floyd-Warshall that does not exploit sparsity. Our experiments suggest that the Floyd-Warshall algorithm can compete with Dijkstra's algorithm (the algorithmic core of Johnson's algorithm) for several classes sparse graphs.

Increasing the parallelism of graph coloring via shortcutting

Graph coloring is an assignment of colors to the vertices of a graph such that no two adjacent vertices get the same color. It is a key building block in many applications. Finding a coloring with a minimal number of colors is often only part of the problem. In addition, the solution also needs to be computed quickly. Several parallel implementations exist, but they may suffer from low parallelism depending on the input graph. We present an approach that increases the parallelism without affecting the coloring quality. On 18 test graphs, our technique yields an average of 3.4 times more parallelism. Our CUDA implementation running on a Titan V is 2.9 times faster on average and uses as few or fewer colors as the best GPU codes from the literature.

Non-blocking interpolation search trees with doubly-logarithmic running time

Balanced search trees typically use key comparisons to guide their operations, and achieve logarithmic running time. By relying on numerical properties of the keys, interpolation search achieves lower search complexity and better performance. Although interpolation-based data structures were investigated in the past, their non-blocking concurrent variants have received very little attention so far.

In this paper, we propose the first non-blocking implementation of the classic interpolation search tree (IST) data structure. For arbitrary key distributions, the data structure ensures worst-case O(log n + p) amortized time for search, insertion and deletion traversals. When the input key distributions are smooth, lookups run in expected O(log log n + p) time, and insertion and deletion run in expected amortized O(log log n + p) time, where p is a bound on the number of threads. To improve the scalability of concurrent insertion and deletion, we propose a novel parallel rebuilding technique, which should be of independent interest.

We evaluate whether the theoretical improvements translate to practice by implementing the concurrent interpolation search tree, and benchmarking it on uniform and nonuniform key distributions, for dataset sizes in the millions to billions of keys. Relative to the state-of-the-art concurrent data structures, the concurrent interpolation search tree achieves performance improvements of up to 15% under high update rates, and of up to 50% under moderate update rates. Further, ISTs exhibit up to 2X less cache-misses, and consume 1.2 -- 2.6X less memory compared to the next best alternative on typical dataset sizes. We find that the results are surprisingly robust to distributional skew, which suggests that our data structure can be a promising alternative to classic concurrent search structures.

YewPar: skeletons for exact combinatorial search

Combinatorial search is central to many applications, yet the huge irregular search trees and the need to respect search heuristics make it hard to parallelise. We aim to improve the reuse of intricate parallel search implementations by providing the first general purpose scalable parallel framework for exact combinatorial search, YewPar.

We make the following contributions. (1) We present a novel formal model of parallel backtracking search, covering enumeration, decision, and optimisation search. (2) We introduce Lazy Node Generators as a uniform API for search tree generation. (3) We present the design and implementation of 12 widely applicable algorithmic skeletons for tree search on shared and distributed memory architectures. (4) Uniquely in the field we demonstrate how a wide range of parallel search applications can easily be constructed by composing Lazy Node Generators and the search skeletons. (5) We report a systematic performance analysis of all 12 YewPar skeletons on standard instances of 7 search applications, investigating skeleton overheads and scalability up to 255 workers on 17 distributed locations.

XIndex: a scalable learned index for multicore data storage

We present XIndex, a concurrent ordered index designed for fast queries. Similar to a recent proposal of the learned index, XIndex uses learned models to optimize index efficiency. Comparing with the learned index, XIndex is able to effectively handle concurrent writes without affecting the query performance by leveraging fine-grained synchronization and a new compaction scheme, Two-Phase Compaction. Furthermore, XIndex adapts its structure according to run-time workload characteristics to support dynamic workload. We demonstrate the advantages of XIndex with both YCSB and TPC-C (KV), a TPC-C variant for key-value stores. XIndex achieves up to 3.2X and 4.4X performance improvement comparing with Masstree and Wormhole, respectively, on a 24-core machine, and it is open-sourced1.

Overlapping host-to-device copy and computation using hidden unified memory

In this paper, we propose a runtime, called HUM, which hides host-to-device memory copy time without any code modification. It overlaps the host-to-device memory copy with host computation or CUDA kernel computation by exploiting Unified Memory and fault mechanisms. HUM provides wrapper functions of CUDA commands and executes host-to-device memory copy commands in an asynchronous manner. We also propose two runtime techniques. One checks if it is correct to make the synchronous host-to-device memory copy command asynchronous. If not, HUM makes the host computation or the kernel computation wait until the memory copy completes. The other subdivides consecutive host-to-device memory copy commands into smaller memory copy requests and schedules the requests from different commands in a round-robin manner. As a result, the kernel execution can be scheduled as early as possible to maximize the overlap. We evaluate HUM using 51 applications from Parboil, Rodinia, and CUDA Code Samples and compare their performance under HUM with that of hand-optimized implementations. The evaluation result shows that executing the applications under HUM is, on average, 1.21 times faster than executing them under original CUDA. The speedup is comparable to the average speedup 1.22 of the hand-optimized implementations for Unified Memory.

<u>G</u>PU <u>i</u>nitiated <u>O</u>penSHMEM: correct and efficient intra-kernel networking for dGPUs

Current state-of-the-art in GPU networking utilizes a host-centric, kernel-boundary communication model that reduces performance and increases code complexity. To address these concerns, recent works have explored performing network operations from within a GPU kernel itself. However, these approaches typically involve the CPU in the critical path, which leads to high latency and inefficient utilization of network and/or GPU resources.

In this work, we introduce GPU Initiated OpenSHMEM (GIO), a new intra-kernel PGAS programming model and runtime that enables GPUs to communicate directly with a NIC without the intervention of the CPU. We accomplish this by exploring the GPU's coarse-grained memory model and correcting semantic mismatches when GPUs wish to directly interact with the network. GIO also reduces latency by relying on a novel template-based design to minimize the overhead of initiating a network operation. We illustrate that for structured applications like a Jacobi 2D stencil, GIO can improve application performance by up to 40% compared to traditional kernel-boundary networking. Furthermore, we demonstrate that on irregular applications like Sparse Triangular Solve (SpTS), GIO provides up to 44% improvement compared to existing intra-kernel networking schemes.

No barrier in the road: a comprehensive study and optimization of ARM barriers

In this paper, we present the first comprehensive performance characterization and optimization of ARM barriers on both mobile and server platforms. We draw a set of observations through several abstracted models and validate them in scenarios where barriers are intensively used. We find that (1) order-preserving approaches without involving the bus significantly outperform other approaches, and (2) the tremendous overhead mostly comes from barriers strictly following remote memory references. Usually, such barriers are inserted when threads are exchanging data, and they are used to ensure the relative order between storing the data to a shared buffer and setting a flag to inform the receiver. Based on the observations, we propose a new mechanism, Pilot, to remove such barriers by leveraging the single-copy atomicity to piggyback the flag with the data. Applying Pilot only requires minor changes to applications and provides 10%-360% performance improvements in multiple benchmarks, which are close to the ideal performance without barriers.

spECK: accelerating GPU sparse matrix-matrix multiplication through lightweight analysis

Sparse general matrix-matrix multiplication on GPUs is challenging due to the varying sparsity patterns of sparse matrices. Existing solutions achieve good performance for certain types of matrices, but fail to accelerate all kinds of matrices in the same manner. Our approach combines multiple strategies with dynamic parameter selection to dynamically choose and tune the best fitting algorithm for each row of the matrix. This choice is supported by a lightweight, multi-level matrix analysis, which carefully balances analysis cost and expected performance gains. Our evaluation on thousands of matrices with various characteristics shows that we outperform all currently available solutions in 79% over all matrices with >15k products and that we achieve the second best performance in 15%. For these matrices, our solution is on average 83% faster than the second best approach and up to 25X faster than other state-of-the-art GPU implementations. Using our approach, applications can expect great performance independent of the matrices they work on.

A novel data transformation and execution strategy for accelerating sparse matrix multiplication on GPUs

SpMM (multiplication of a sparse matrix and a dense matrix) and SDDMM (sampled dense-dense matrix multiplication) are at the core of many scientific, machine learning, and data mining applications. Because of the irregular memory accesses, the two kernels have poor data locality, and data movement overhead is a bottleneck for their performance. To overcome this issue, previous works have proposed using tiling and data reorganization to enhance data reuse. Despite their success in improving the performance for many sparse matrices, we find that the efficacy of existing techniques largely depends on how the non-zeros are distributed in a sparse matrix. In this work, we propose a novel row-reordering technique to improve data locality for SpMM and SDDMM on GPUs. The goal of such row reordering is to place similar rows close to each other, allowing them to be processed together, and thus providing better temporal locality for the values of the dense matrix. We focus on performing the row-reordering efficiently, by using a hierarchical clustering procedure optimized by locality-sensitive hashing. We also investigate when row-reordering is useful, and what factors the performance gains from our method are correlated to. Experimental evaluation using 1084 sparse matrices from SuiteSparse collection and Network Repository shows that our technique achieves up to 2.91x speedup for SpMM and up to 3.19x speedup for SDDMM against the state-of-the-art alternatives on an Nvidia P100 GPU.

MatRox: modular approach for improving data locality in hierarchical (Mat)rix App(Rox)imation

Hierarchical matrix approximations have gained significant traction in the machine learning and scientific community as they exploit available low-rank structures in kernel methods to compress the kernel matrix. The resulting compressed matrix, HMatrix, is used to reduce the computational complexity of operations such as HMatrix-matrix multiplications with tuneable accuracy in an evaluation phase. Existing implementations of HMatrix evaluations do not preserve locality and often lead to unbalanced parallel execution with high synchronization. Also, current solutions require the compression phase to re-execute if the kernel method or the required accuracy change. MatRox is a framework that uses novel structure analysis strategies with code specialization and a storage format to improve locality and create load-balanced parallel tasks for HMatrix-matrix multiplications. Modularization of the matrix compression phase enables the reuse of computations when there are changes to the input accuracy and the kernel function. The MatRox-generated code for matrix-matrix multiplication is 2.98X, 1.60X, and 5.98X faster than library implementations available in GOFMM, SMASH, and STRUMPACK respectively. Additionally, the ability to reuse portions of the compression computation for changes to the accuracy leads to up to 2.64X improvement with MatRox over five changes to accuracy using GOFMM.

A parallel sparse tensor benchmark suite on CPUs and GPUs

Tensor computations present significant performance challenges that impact a wide spectrum of applications. Efforts on improving the performance of tensor computations include exploring data layout, execution scheduling, and parallelism in common tensor kernels. This work presents a benchmark suite for arbitrary-order sparse tensor kernels using state-of-the-art tensor formats: coordinate (COO) and hierarchical coordinate (HiCOO). It demonstrates a set of reference tensor kernel implementations and some observations on Intel CPUs and NVIDIA GPUs. The full paper can be referred to at http://arxiv.org/abs/2001.00660.

Nesting and composition in transactional data structure libraries

Transactional data structure libraries (TDSL) combine the ease-of-programming of transactions with the high performance and scalability of custom-tailored concurrent data structures. They can be very efficient thanks to their ability to exploit data structure semantics in order to reduce overhead, aborts, and wasted work compared to general-purpose software transactional memory. However, TDSLs were not previously used for complex use-cases involving long transactions and a variety of data structures.

In this work, we boost the performance and usability of a TDSL, allowing it to support complex applications. A key idea is nesting. Nested transactions create checkpoints within a longer transaction, so as to limit the scope of abort, without changing the semantics of the original transaction. We build a Java TDSL with built-in support for nesting in a number of data structures. We conduct a case study of a complex network intrusion detection system that invests a significant amount of work to process each packet. Our study shows that our library outperforms TL2 twofold without nesting, and by up to 16x when nesting is used. Finally, we discuss cross-library nesting, namely dynamic composition of transactions from multiple libraries.

ELDA: LDA made efficient via algorithm-system codesign submission

Latent Dirichlet Allocation (LDA) is a statistical approach for topic modeling with a wide range of applications. In spite of the significance, we observe very few attempts from system track to improve LDA, let alone the algorithm and system codesigned efforts. To this end, we propose eLDA with an algorithm-system codesigned optimization. Particularly, we introduce a novel three-branch sampling mechanism to taking advantage of the convergence heterogeneity of various tokens in order to reduce redundant sampling task. Our evaluation shows that eLDA outperforms the state-of-the-arts.

Identifying scalability bottlenecks for large-scale parallel programs with graph analysis

Scaling a parallel program to modern supercomputers is challenging due to inter-process communication, code serialization, and resource contention. Performance analysis tools for finding such scaling bottlenecks either base on profiling or tracing. Profiling incurs lower overheads but does not capture detailed dependencies needed for root-cause analyses. Tracing collects all information at prohibitive overheads. In this work, we develop ScalAna that uses static analysis techniques to achieve the best of both worlds---it enables the analyzability of traces at a cost similar to profiling. We leverage compiler and runtime lightweight techniques to generate performance graph and perform graph analysis algorithm to detect the root cause of scaling issues. We evaluate ScalAna with real applications on the Tianhe-2 supercomputer. Results show that our approach can effectively locate the root cause of scalability bottlenecks for real applications and incur less than 6.38% overhead (1.89% on average) for up to 2,048 processes.

Revisiting linpack algorithm on large-scale CPU-GPU heterogeneous systems

As the widening gap between GPU computing capability and other components (CPU, PCIe bus and communication network), it's increasingly challenging to design high performance parallel algorithms for large CPU-GPU heterogeneous systems. There are mainly two reasons. Firstly, simply offloading the kernel library to GPU incurs large volume data transfer through low-speed PCIe bus. Secondly, communication overheads through network severely affects scalability. To solve the above issues, we advocate a paradigm shift to CPU-centric and fine-grained pipelining algorithm design. By taking Linpack benchmark as a case study, the new algorithm design paradigm shows its effectiveness. Our optimized Linpack program achieves 63.79PFlops on 16384 GPUs. Its floating-point efficiency outperforms the NVIDIA proprietary counterparts by 5% on average.

Neighbor-list-free molecular dynamics on sunway TaihuLight supercomputer

Molecular dynamics (MD) simulations are playing an increasingly important role in many research areas. Pair-wise potentials are widely used in MD simulations of bio-molecules, polymers, and nano-scale materials. Due to a low compute-to-memory-access ratio, their calculation is often bounded by memory transfer speeds. Sunway TaihuLight is one of the fastest supercomputers featuring a custom SW26010 many-core processor. Since the SW26010 has some critical limitations regarding main memory bandwidth and scratchpad memory size, it is considered as a good platform to investigate the optimization of pair-wise potentials especially in terms of data reusage. MD algorithms often use a neighbor-list data structure to reduce the computational workload. In this paper, we show that a cell-list-based approach is more suitable for the SW26010 processor. We apply a number of novel optimization methods including self-adaptable replica-summation for conflict-free parallelization, parameter profiles for flexible vectorization, and particle-cell cutoff checking filters for reducing the computational workload. We also established an open source standalone framework featuring the techniques above, ESMD1, which is at least 50% faster than the latest existing LAMMPS port on a single TaihuLight node. Furthermore, EMSD achieves a weak scaling efficiency of 88% on 4,096 nodes.

A tool for top-down performance analysis of GPU-accelerated applications

To support performance measurement and analysis of GPU-accelerated applications, we extended the HPCToolkit performance tools with several novel features. To support efficient monitoring of accelerated applications, HPCToolkit employs a new wait-free data structure to coordinate measurement and attribution between each application thread and a GPU monitor thread. To help developers understand the performance of accelerated applications, HPCToolkit attributes metrics to heterogeneous calling contexts that span both CPUs and GPUs. To support fine-grain analysis and tuning of GPU-accelerated code, HPCToolkit collects PC samples of both CPU and GPU activity to derive and attribute metrics at all levels in a heterogeneous calling context.

Functional faults

Hardware and software faults increasingly surface in today's computing environment and vast theoretical and practical research efforts are being devoted to overcoming the effects of malfunctionality in the computing process. Most research to date, however, has focused on how to discover and handle faulty data. We initiate a formal study of faulty functionality in a modern multicore shared-memory environment. We introduce a model of functional faults, and study avenues that allow tolerating functional faults while maintaining the correctness of the entire computation. We demonstrate the generality of this model by constructing a robust consensus protocol from functionally-faulty compare-and-swap objects. Additionally, We formally prove (tight) impossibility result for the same constructions.

Breaking master-slave model between host and FPGAs

This paper proposes to enhance current task-based programming models by breaking their current master-slave approach between the main processor and its hardware accelerators. As a proof-of-concept, it presents an extension of the OmpSs@FPGA toolchain that allows the tasks offloaded into the FPGA to create and synchronize nested tasks on their own without involving the host. Those FPGA spawned tasks may target the host to execute code not suitable for the FPGA, like system calls or I/O operations; or target other kernel accelerators inside the same FPGA. In addition to the programmability benefits of this new feature, the proposed system presents significant performance improvements and a better productivity over the classical master-slave approach.

Understanding and optimizing persistent memory allocation

The proliferation of fast, dense, byte-addressable nonvolatile memory suggests the possibility of keeping data in pointer-rich "in-memory" format across program runs and even crashes. For full generality, such data requires dynamic memory allocation. Toward this end, we introduce recoverability, a correctness criterion for persistent allocators, together with a nonblocking allocator, Ralloc, that satisfies this criterion. Ralloc is based on LRMalloc [8], with three key innovations. First, we persist just enough information during normal operation to permit reconstruction of the heap after a full-system crash. Our reconstruction mechanism performs garbage collection (GC) to identify and remedy any failure-induced memory leaks. Second, in support of GC, we introduce the notion of filter functions, which identify the locations of pointers within persistent blocks. Third, to allow persistent regions to be mapped at an arbitrary address, we employ the position-independent pointer representation of Chen et al. [4], both in data and in allocator metadata.

Experiments show that Ralloc provides scalable performance competitive to that of both Makalu [2], the state-of-the-art lock-based persistent allocator, and the best transient allocators (e.g., JEMalloc [5]).

Testing concurrency on the JVM with lincheck

Concurrent programming can be notoriously complex and error-prone. Programming bugs can arise from a variety of sources, such as operation re-reordering, or incomplete understanding of the memory model. A variety of formal and model checking methods have been developed to address this fundamental difficulty. While technically interesting, existing academic methods are still hard to apply to the large codebases typical of industrial deployments, which limits their practical impact.

ArcherGear: data race equivalencing for expeditious HPC debugging

There is growing uptake of shared memory parallelism in high performance computing, and this has increased the need for data race checking during the creation of new parallel codes or parallelizing existing sequential codes. While race checking concepts and implementations have been around for many concurrency models, including tasking models such as Cilk and PThreads (e.g., the Thread Sanitizer tool), practically usable race checkers for other APIs such as OpenMP have been lagging. For example, the OpenMP parallelization of an important library (namely Hypre) was initially unsuccessful due to inexplicable nondeterminism introduced when the code was optimized, and later root-caused to a race by the then recently developed OpenMP race checker Archer [2]. The open-source Archer now enjoys significant traction within several organizations.

Reflector: a fine-grained I/O tracker for HPC systems

We present Reflector, to support both high-level and low-level I/O monitoring through user-defined interfaces such as HDF5 and NetCDF in addition to POSIX- and MPI-IO. We evaluate Reflector on both an on-premises 500-core HPC cluster and a leadership-class supercomputer at the Lawrence Berkeley National Laboratory. Preliminary results are promising as the system prototype incurs negligible performance overhead and clearly illustrates the I/O patterns and bottlenecks of multiple applications.

Nonblocking persistent software transactional memory

While developed largely for higher density and lower power, byte-addressable nonvolatile memory can also allow data to persist across program runs and system crashes without the need to flush to disk or flash. If data is to be recovered after a crash, however, care must be taken to ensure that the contents of memory are consistent at all times. This can be challenging in multithreaded applications with write-back caches. We present QSTM, a persistent word-based software transactional memory (STM) system to address this problem. Unlike past such systems, QSTM is nonblocking and does not require either the modification of target data structures or the use of a wide CAS instruction.

Optimizing GPU programs by partial evaluation

While GPU utilization allows one to speed up computations to the orders of magnitude, memory management remains the bottleneck making it often a challenge to achieve the desired performance. Hence, different memory optimizations are leveraged to make memory being used more effectively. We propose an approach automating memory management utilizing partial evaluation, a program transformation technique that enables data accesses to be pre-computed, optimized, and embedded into the code, saving memory transactions. An empirical evaluation of our approach shows that the transformed program could be up to 8 times as efficient as the original one in the case of CUDA C naïve string pattern matching algorithm implementation.

Restricted memory-friendly lock-free bounded queues

Multi-producer multi-consumer FIFO queue is one of the fundamental concurrent data structures used in software systems. A lot of progress has been done on designing concurrent bounded and unbounded queues [1--10]. As previous works show, it is extremely hard to come up with an efficient algorithm. There are two orthogonal ways to improve the performance of fair concurrent queues: reducing the number of compare-and-swap (CAS) calls, and making queues more memory-friendly by reducing the number of allocations. The most up-to-date efficient algorithms choose the first path and use more scalable fetch-and-add (FAA) instead of CAS [3, 4, 10]. For the second path, the standard way to design memory-friendly versions is to implement queues on top of arrays [2--4, 10]. For unbounded queues it is reasonable to allocate memory in chunks, constructing a linked queue on them; this approach significantly improves the performance. The bounded queues are more memory-friendly by design: they are represented as a fixed-sized array of elements even in theory. However, most of the bounded queue implementations still have issues with memory allocations --- typically, they either use descriptors [5, 8] or store some additional meta-information along with the elements [1, 6, 7, 9].

Understand the overheads of storage data structures on persistent memory

The byte-addressable persistent memory (PMEM) devices have opened new opportunities for building high-performance storage systems. With both DRAM and PMEM in the system, it is important to choose the correct storage data structures on each of them to achieve the best overall performance and the needed data persistence. However, this is non-trivial. One reason is the limited understanding of the actual performance characteristic of different data structures on PMEM. In this study, we develop storeds-bench to help developers better understand the overhead they will encounter when using a certain data structure on PMEM. Specifically, storeds-bench is designed as a benchmark suite that leverages YCSB and has various commonly used storage data structures implemented using PMDK (persistent memory develop kit) under different persistent and consistency requirements.

PLUM: static parallel program locality analysis under uniform multiplexing

Data movement has a significant impact on program performance. For multithread programs, this impact is amplified, since different threads often interfere with each other by competing for shared cache space. However, recent de facto locality metrics consider either sequential execution only, or derive locality for multithread programs in an inefficient way, i.e. exhaustive simulation.

This paper presents PLUM, a compiler solution for time-scale locality analysis for parallel programs. Experiments demonstrate that the prediction accuracy is 93.97% on average. PLUM is the first tool that analyzes data locality for parallel programs during compile time; in addition, it provides an approach for efficiently studying the representative interleaving pattern for parallel executions.