This paper compares OpenCL, CUDA, and HIP as compilation targets for Futhark, a functional array language. We compare the performance of OpenCL versus CUDA, and OpenCL versus HIP, on the code generated by the Futhark compiler on a collection of 48 application benchmarks on two different GPUs. Despite the generated code in most cases being equivalent, we observe significant performance differences on the same hardware, ranging from 0.42x to 1.72x in the most extreme cases. We identify the root causes of most of these differences, many of which are due to relatively superficial details such as inconsistent defaults regarding compiler optimisation and numerical accuracy, although a few remain mysterious.
We present an Integer Linear Programming based approach to finding the optimal fusion strategy for combinator-based parallel programs. While combinator-based languages or libraries provide a convenient interface for programming parallel hardware, fusing combinators to more complex operations is essential to achieve the desired performance. Our approach is not only suitable for languages with the usual map, fold, scan, indexing and scatter operations, but also gather operations, which access arrays in arbitrary order, and therefore goes beyond the traditional producer-consumer fusion. It can be parametrised with appropriate cost functions, and is fast enough to be suitable for just-in-time compilation.