Many numerical algorithms on matrices or tensors can be formulated in a blocking style which improves performance due to better cache locality. In imperative languages, blocking is achieved by introducing additional layers of loops in a nested fashion alongside with suitable adjustments in index computations. While this process is tedious and error-prone, it is also difficult to implement a generically blocked version that would support arbitrary levels of blocking.
At the example of matrix multiply, this paper demonstrates how rank-polymorphic array languages enable the specification of such generically blocked algorithms in a simple, recursive form. The depth of the blocking as well as blocking factors can be encoded in the structure of array shapes. In turn, reshaping arrays makes it possible to switch between blocked and non-blocked arrays. Through rank-polymorphic array combinators, any specification of loop boundaries or explicit index computations can be avoided.
Firstly, we propose a dependently-typed framework for rank-polymorphic arrays. We use it to demonstrate that all blocked algorithms can be naturally derived by induction on the argument shapes. Our framework guarantees lack of out-of-bound indexing, and we also prove that all the blocked versions compute the same results as the canonical algorithm. Secondly, we translate our specification to the array language SaC. Not only do we show that we achieve similar conciseness in the implementation, but we also observe good performance of the generated code. We achieve a 7% improvement compared to the highly-optimised OpenBLAS library, and 3% compared to Intel’s MKL library when running on a 32-core shared-memory system.
Optimal usage of the memory system is a key element of fast GPU algorithms. Unfortunately many common algorithms fail in this regard despite exhibiting great regularity in memory access patterns. In this paper we propose efficient kernels to permute the elements of an array. We handle a class of permutations known as Bit Matrix Multiply Complement (BMMC) permutations, for which we design kernels of speed comparable to that of a simple array copy. This is a first step towards implementing a set of array combinators based on these permutations.
We present a dependent type system for enforcing array-size consistency in an ML-style functional array language. Our goal is to enforce shape-consistency at compile time and allow nontrivial transformations on array shapes, without the complexity such features tend to introduce in dependently typed languages. Sizes can be arbitrary expressions and size equality is purely syntactical, which fits naturally within a scheme that interprets size-polymorphic functions as having implicit arguments. When non-syntactical equalities are needed, we provide dynamic checking. In contrast to other dependently typed languages, we automate the book-keeping involved in tracking existential sizes, such as when filtering arrays. We formalise a large subset of the presented type system and prove it sound. We also discuss how to adapt the type system for a real implementation, including type inference, within the Futhark programming language.