Truly heterogeneous systems enable partitioned workloads to be mapped to the hardware that nets the best performance. However, current practice requires that inter-device communication between different vendors' hardware use host memory as an intermediary step. To date, there are no widely adopted solutions that allow accelerators to directly transfer data. A new cache-coherent protocol, CXL, aims to facilitate easier, fine-grained sharing between accelerators. In this work we analyze existing methods for designing heterogeneous applications that target GPUs and FPGAs working collaboratively, followed by an exploration to show the benefits of a CXL-enabled system. Specifically, we develop a test application that utilizes both an NVIDIA P100 GPU and a Xilinx U250 FPGA to show current communication limitations. From this application, we capture overall execution time and throughput measurements on the FPGA and GPU. We use these measurements as inputs to novel CXL performance models to show that using CXL caching instead of host memory results in a 1.31X speedup, while a more tightly-coupled pipelined implementation using CXL-enabled hardware would result in a speedup of 1.45X.
This paper proposes a hierarchical organization of a distributed, parallel and heterogeneous computer system based on the Sequential Codelet Model. It defines a Hybrid Dataflow/von Neumann architecture which enables parallel execution of sequentially defined Codelets at a multi-hierarchical organization of the system. This architecture takes advantage of instruction level parallelism techniques based on dataflow analysis, allowing implicit parallel execution of code. We present the SuperCodelet Architecture. We describe the program execution model and its corresponding programming model, and provide an evaluation of its program execution model through a prototype emulator based on commodity computer systems.