In this paper we present PIRA – an infrastructure for automatic instrumentation refinement for performance analysis. It automates the generation of an initial performance overview measurement and gradually refines it, based on the recorded runtime information. This can help a performance analyst with the time consuming and largely manual, yet mechanical, task of selecting which functions to capture in subsequent measurements. PIRA implements an existing aggregation strategy that heuristically determines which functions to include for initial overview measurements. Moreover, it implements a newly developed heuristic to incorporate profile information and expand instrumentation in hot-spot regions only. The approach is evaluated on different benchmarks, including the SU 2 multi-physics solver package. PIRA is able to generate instrumentation configurations that contain the application’s hot-spot, but generate significantly less overhead when compared to the Score-P reference measurement.
Field Programmable Gate Arrays, FPGAs, are a widely available configurable hardware design that is commonly used in many domain-specific applications. However, the complexity of its programming interface is currently restricting its usage to highly qualified programmers dedicated to FPGAs. In order to democratize FPGAs, many efforts are concentrating on High-Level Synthesis, HLS: the process of compiling a high-level language to hardware. In that context we propose PyGA, a proof of concept of a Python to FPGA compiler based on the Numba Just-In-Time (JIT) compiler for Python and the Intel FPGA SDK for OpenCL. It allows any Python user to use a FPGA card as an accelerator for Python seamlessly. As expected, early performance results are encouraging, but not competitive with compiled CPU version. The study shows that, to avoid overhead that cannot be compensated otherwise, tightly coupled accelerator design such as Intel Xeon+FPGA are necessary to address larger code base with finer grain kernel. It also shows that without FPGA-specific programming effort, HLS compilation and runtime efforts remain to be done to be competitive with modern multi-core CPUs.