We take advantage of the new tasking features in OpenMP to propose advanced task-parallel algorithms for the inversion of dense matrices via Gauss-Jordan elimination. Our algorithms perform a partitioning of the matrix operand into two levels of tasks: The matrix is first divided vertically, by column blocks (or panels), in order to accommodate the standard partial pivoting scheme that ensures the numerical stability of the method. In addition, depending on the particular kernel to be applied, each panel is partitioned either horizontally by row blocks (tiles) or vertically by μ-panels (of columns), in order to extract sufficient task parallelism to feed a many-threaded general purpose processor (CPU).
The results of the experimental evaluation show the performance benefits of the advanced tasking algorithms on an Intel Xeon Gold processor with 20 cores.
Convolutional Neural Network (CNN) based Deep Learning (DL) has achieved great progress in many real-life applications. Meanwhile, due to the complex model structures against strict latency and memory restriction, the implementation of CNN models on the resource-limited platforms is becoming more challenging. This work proposes a solution, called CompactNet, which automatically optimizes a pre-trained CNN model on a specific resource-limited platform given a specific target of inference speedup. Guided by a simulator of the target platform, CompactNet progressively trims a pre-trained network by removing certain redundant filters until the target speedup is reached and generates an optimal platform-specific model while maintaining the accuracy. We evaluate our work on two platforms of a mobile ARM CPU and a machine learning accelerator NPU (Cambricon-1A ISA) on a Huawei Mate10 smartphone. For the state-of-the-art slim CNN model made for the embedded platform, MobileNetV2, CompactNet achieves up to a 1.8x kernel computation speedup with equal or even higher accuracy for image classification tasks on the ImageNet dataset, which outperforms other successful CNN optimizing techniques. Compared with the state-of-the-art Neural Architecture Searching (NAS) work, the optimal model generated through our CompactNet is faster and can be applied to bigger datasets like ImageNet. Furthermore, the optimizing process is much faster than those searching approaches.
Alternative programming models and runtimes are increasing in popularity and maturity. This allows porting and comparing, on competitive grounds, emerging parallel approaches against the traditional MPI+X paradigm. In this work, an implementation of distributed task-based stencil computation is compared with a traditional MPI+X implementation of the same application. The Legion task-based parallel programming system is used as an alternative to MPI, but the underlying OpenMP approach is kept at the subdomain level. Overall results are promising toward making this alternative method competitive to the traditional MPI approach. In future work, extensions to other applications will be explored, as well as the use of GPUs.