We present a type system for expressing size constraints on array types in an ML-style type system. The goal is to detect shape mismatches at compile-time, while being simpler than full dependent types. The main restrictions is that the only terms that can occur in types are array sizes, and syntactically they must be variables or constants. For those programs where this is not sufficient, we support a form of existential types, with the type system automatically managing the requisite book-keeping. We formalise a large subset of the type system in a small core language, which we prove sound. We also present an integration of the type system in the high-performance parallel functional language Futhark, and show on a collection of 44 representative programs that the restrictions in the type system are not too problematic in practice.
Multi-dimensional array manipulation constitutes a core component of numerous numerical methods, e.g. finite difference solvers of Partial Differential Equations (PDEs). The efficiency of such computations is tightly connected to traversing array data in a hardware-friendly way.
The Mathematics of Arrays (MoA) allows reasoning about array computations at a high level and enables systematic transformations of array-based programs. We have previously shown that stencil computations reduce to MoA's Denotational Normal Form (DNF).
Here we bring to light MoA's Operational Normal Forms (ONFs) that allow for adapting array computations to hardware characteristics. ONF transformations start from the DNF. Alongside the ONF transformations, we extend MoA with rewriting rules for padding. These new rules allow both a simplification of array indexing and a systematic approach to introducing halos to PDE solvers. Experiments on various architectures confirm the flexibility of the approach.
This paper reports on the acceleration of a standard, lattice-based numerical algorithm that is widely used in finance for pricing a class of fixed-income vanilla derivatives.
We start with a high-level algorithmic specification, exhibiting irregular nested parallelism, which is challenging to map efficiently to GPU hardware. From it we systematically derive and optimize two CUDA implementations, which utilize only the outer or all levels of parallelism, respectively. A detailed evaluation demonstrates (i) the high impact of the proposed optimizations, (ii) the complementary strength and weaknesses of the two GPU versions, and that (iii) they are on average 2.4× faster than our well-tuned CPU-parallel implementation (OpenMP+AVX2) running on 104 hardware threads, and by 3-to-4 order of magnitude faster than an OpenMP-parallel implementation using the popular QuantLib library.
Most implementations of machine learning algorithms are based on special-purpose frameworks such as TensorFlow or PyTorch. While these frameworks are convenient to use, they introduce multi-million lines of code dependency that one has to trust, understand and potentially modify. As an alternative, this paper investigates a direct implementation of a state of the art Convolutional Neural Network (CNN) in an array language. While our implementation requires 150 lines of code to define the special-purpose operators needed for CNNs, which are readily provided through frameworks such as TensorFlow and PyTorch, our implementation outperforms these frameworks by factors 2 and 3 on a fixed set of hardware — a 64-core GPU-accelerated machine; for a simple example network. The resulting specification is written in a rank-polymorphic data-parallel style, and it can be immediately leveraged by optimising compilers. Indeed, array languages make neural networks fast.