Migrating CPU code to the CUDA programming language has been a challenge for some time. While the code for many high-performance and massively data-parallel applications has been successfully ported to GPUs, this task has received comparatively less attention for other applications that do not lend themselves so well to the characteristics of GPUs. Among the latter are applications of particular value to industry, where the use of GPUs can significantly improve the productivity of the system as a whole.
This article presents a real-world, industrial use case that shows (part of) a complex computer vision workflow. The processes in the workflow have low arithmetic intensity and work with simple data (integers). The main challenge is therefore to minimise the overhead of data transfers between the host and the GPU, or even within the memory of the device itself. While the speed-up achieved may not be as impressive as for applications that are perfectly tuned to the GPU architecture, the algorithms and data distribution proposed in this work allow a significant part of the overall workflow to be offloaded to the GPU, freeing the CPU to process other components of the workflow. As a result, this approach has a significant impact on the productivity and economic performance of the company.
We propose a framework for using static resource analysis to guide the automatic optimization of general-purpose GPU (GPGPU) kernels written in CUDA, NVIDIA's framework for GPGPU programming. In our proposed framework, optimizations are applied to the kernel and candidate kernels are evaluated for performance by running a static analysis that predicts the execution cost of GPU kernels. The use of static analysis, in contrast to many existing frameworks for performance tuning GPU kernels, lends itself to high-level, hardware-independent optimizations that can be of particular benefit to novice programmers unfamiliar with CUDA's performance pitfalls. As a proof of concept, we have implemented two example optimizations and a simple search strategy in a tool called COpPER (CUDA Optimization through Programmatic Estimation of Resources), which makes use of a static resource analysis tool for CUDA from prior work. The prototype tool automatically improves the performance of sample kernels by 2-4% in initial experiments, and demonstrates the feasibility of using static analysis as part of automated performance tuning for GPU kernels.
Performance optimization continues to be a challenge in modern HPC software. Existing performance optimization techniques, including profiling-based and auto-tuning techniques, fail to indicate program modifications at the source level thus preventing their portability across compilers. This paper describes Muppet, a new approach that identifies program modifications called mutations aimed at improving program performance. Muppet's mutations help developers reason about performance defects and missed opportunities to improve performance at the source code level. In contrast to compiler techniques that optimize code at intermediate representations (IR), Muppet uses the idea of source-level mutation testing to relax correctness constraints and automatically discover optimization opportunities that otherwise are not feasible using the IR. We demonstrate the Muppet's concept in the OpenMP programming model. Muppet generates a list of OpenMP mutations that alter the program parallelism in various ways, and is capable of running a variety of optimization algorithms such as Bayesian Optimization and delta debugging to find a subset of mutations which, when applied to the original program, cause the most speedup while maintaining program correctness. When Muppet is evaluated against a diverse set of benchmark programs and proxy applications, it is capable of finding sets of mutations in 70% of the evaluated programs that induce speedup.
Memory and power constraints limit the current landscape of high-performance computing. Hardware specializations in clusters lead to heterogeneity, Non-Uniform Memory Architecture (NUMA) effects, and accelerator offloading. These increase the complexity of developing and optimizing scientific software.
To ease these challenges for domain scientists, the code generator for a prototype of the Parallel Pattern Language (PPL) is implemented, enabling its evaluation. The proof of concept uses parallel patterns to define parallelism and apply static global optimizations automatically. Most notably, an assignment between tasklets and the provided heterogeneous cluster architecture is calculated during compile time, the new code generator creates a source file combining shared-memory, distributed-memory, and accelerator offloading according to the generated mapping.
The prototype successfully optimizes and compiles most Rodinia benchmarks. Six Rodinia benchmarks already show significant speedups. The tools limitations include dynamic algorithms that are challenging to analyze statically and overheads during the compile time optimization.
Many computational problems consider memory throughput a performance bottleneck. The problem becomes even more pronounced in the case of parallel platforms, where the ratio between computing elements and memory bandwidth shifts towards computing. Software needs to be attuned to hardware features like cache architectures or memory banks to reach a decent level of performance efficiency. This can be achieved by selecting the right memory layouts for data structures or changing the order of data structure traversal. In this work, we present an abstraction for traversing a set of regular data structures (e.g., multidimensional arrays) that allows the design of traversal-agnostic algorithms. Such algorithms can be adjusted for particular memory layouts of the data structures, semi-automated parallelization, or autotuning without altering their internal code. The proposed solution was implemented as an extension of the Noarr library that simplifies a layout-agnostic design of regular data structures. It is implemented entirely using C++ template meta-programming without any nonstandard dependencies, so it is fully compatible with existing compilers, including CUDA NVCC. We evaluate the performance and expressiveness of our approach on the Polybench-C benchmarks.
We present three novel parallel scan algorithms for multi-core CPUs which do not need to fix the number of available cores at the start, and have zero overhead compared to sequential scans when executed on a single core. These two properties are in contrast with most existing parallel scan algorithms, which are asymptotically optimal, but have a constant factor overhead compared to sequential scans when executed on a single core. We achieve these properties by adapting the classic three-phase scan algorithms. The resulting algorithms also exhibit better performance than the original ones on multiple cores. Furthermore, we adapt the chained scan with decoupled look-back algorithm to also have these two properties. While this algorithm was originally designed for GPUs, we show it is also suitable for multi-core CPUs, outperforming the classic three-phase scans in our benchmarks, by better using the caches of the processor at the cost of more synchronisation. In general our adaptive chained scan is the fastest parallel scan, but in specific situations our assisted reduce-then-scan is better.