Tiling matrix operations can improve the load balancing and performance of applications on heterogeneous computing resources. Writing a tile-based algorithm for each operation with a traditional, hand-tuned tiling approach that uses for loops in C/C++ is cumbersome and error prone. Moreover, it must enable and support the heterogeneous memory management of data objects and also explore architecture-supported, native, tiled-data transfer APIs instead of copying the tiled data to continuous memory before the data transfer. The tiling framework provides a tiled data structure for heterogeneous memory mapping and parameterization to a heterogeneous task specification API. We have integrated our tiled framework into MatRIS (Math kernels library using IRIS). IRIS is a heterogeneous run-time framework with a heterogeneous programming model, memory model, and task execution model. Experiments reveal that the tiled framework for BLAS operations has improved the programmability of tiled BLAS and improved performance by ~20% when compared against the traditional method that copies the data to continuous memory locations for heterogeneous computing.
Tuning tensor program generation involves searching for various possible program transformation combinations for a given program on target hardware to optimize the tensor program execution. It is already a complex process because of the massive search space, and exponential combinations of transformations make auto-tuning tensor program generation more challenging, especially when we have a heterogeneous target. In this research, we attempt to address these problems by learning the joint neural network and hardware features and transferring them to the new target hardware. We extensively study the existing state-of-the-art dataset, TenSet, perform comparative analysis on the test split strategies and propose methodologies to prune the dataset. We adopt an attention-inspired approach for tuning the tensor programs enabling them to embed neural network and hardware-specific features. Our approach could prune the dataset up to 45% of the baseline without compromising the Pairwise Comparison Accuracy (PCA). Further, the proposed methodology can achieve on-par or improved mean inference time with 25%-40% of the baseline tuning time across different networks and target hardware.
Specialized processors designed to accelerate tensor operations are evolving faster than conventional processors. This trend of architectural innovations greatly benefits artificial intelligence (AI) workloads. However, it is unknown how well AI-optimized accelerators can be retargeted to scientific applications. To answer this question we explore (1) whether a typical scientific modeling kernel can be mapped efficiently to tensor operations and (2) whether this approach is portable across diverse processors and AI accelerators. In this paper we implement two versions of tracer advection in an ocean-modeling application using PyTorch and evaluate these on one CPU, two GPUs, and Google's TPU. Our findings are that scientific modeling can observe both a performance boost and improved portability by mapping key computational kernels to tensor operations.