TY - GEN
T1 - Analysis and modeling of collaborative execution strategies for heterogeneous CPU-FPGA architectures
AU - Huang, Sitao
AU - De Gonzalo, Simon Garcia
AU - El-Hadedy, Mohamed
AU - Chang, Li Wen
AU - Gómez-Luna, Juan
AU - Milojicic, Dejan
AU - El Hajj, Izzat
AU - Chalamalasetti, Sai Rahul
AU - Mutlu, Onur
AU - Chen, Deming
AU - Hwu, Wen Mei
N1 - Funding Information:
This work was supported by Hewlett Packard Labs and the Applications Driving Architectures (ADA) Research Center, a JUMP Center co-sponsored by SRC and DARPA. We also thank Intel, VMware, Huawei, AliBaba, and Google for their gift funding support.
Publisher Copyright:
© 2019 Association for Computing Machinery.
PY - 2019/4/4
Y1 - 2019/4/4
N2 - Heterogeneous CPU-FPGA systems are evolving towards tighter integration between CPUs and FPGAs for improved performance and energy efficiency. At the same time, programmability is also improving with High Level Synthesis tools (e.g., OpenCL Software Development Kits), which allow programmers to express their designs with high-level programming languages, and avoid time-consuming and error-prone register-transfer level (RTL) programming. In the traditional loosely-coupled accelerator mode, FPGAs work as offload accelerators, where an entire kernel runs on the FPGA while the CPU thread waits for the result. However, tighter integration of the CPUs and the FPGAs enables the possibility of fine-grained collaborative execution, i.e., having both devices working concurrently on the same workload. Such collaborative execution makes better use of the overall system resources by employing both CPU threads and FPGA concurrency, thereby achieving higher performance. In this paper, we explore the potential of collaborative execution between CPUs and FPGAs using OpenCL High Level Synthesis. First, we compare various collaborative techniques (namely, data partitioning and task partitioning), and evaluate the tradeoffs between them. We observe that choosing the most suitable partitioning strategy can improve performance by up to 2×. Second, we study the impact of a common optimization technique, kernel duplication, in a collaborative CPU-FPGA context. We show that the general trend is that kernel duplication improves performance until the memory bandwidth saturates. Third, we provide new insights that application developers can use when designing CPU-FPGA collaborative applications to choose between different partitioning strategies. We find that different partitioning strategies pose different tradeoffs (e.g., task partitioning enables more kernel duplication, while data partitioning has lower communication overhead and better load balance), but they generally outperform execution on conventional CPU-FPGA systems where no collaborative execution strategies are used. Therefore, we advocate even more integration in future heterogeneous CPU-FPGA systems (e.g., OpenCL 2.0 features, such as fine-grained shared virtual memory).
AB - Heterogeneous CPU-FPGA systems are evolving towards tighter integration between CPUs and FPGAs for improved performance and energy efficiency. At the same time, programmability is also improving with High Level Synthesis tools (e.g., OpenCL Software Development Kits), which allow programmers to express their designs with high-level programming languages, and avoid time-consuming and error-prone register-transfer level (RTL) programming. In the traditional loosely-coupled accelerator mode, FPGAs work as offload accelerators, where an entire kernel runs on the FPGA while the CPU thread waits for the result. However, tighter integration of the CPUs and the FPGAs enables the possibility of fine-grained collaborative execution, i.e., having both devices working concurrently on the same workload. Such collaborative execution makes better use of the overall system resources by employing both CPU threads and FPGA concurrency, thereby achieving higher performance. In this paper, we explore the potential of collaborative execution between CPUs and FPGAs using OpenCL High Level Synthesis. First, we compare various collaborative techniques (namely, data partitioning and task partitioning), and evaluate the tradeoffs between them. We observe that choosing the most suitable partitioning strategy can improve performance by up to 2×. Second, we study the impact of a common optimization technique, kernel duplication, in a collaborative CPU-FPGA context. We show that the general trend is that kernel duplication improves performance until the memory bandwidth saturates. Third, we provide new insights that application developers can use when designing CPU-FPGA collaborative applications to choose between different partitioning strategies. We find that different partitioning strategies pose different tradeoffs (e.g., task partitioning enables more kernel duplication, while data partitioning has lower communication overhead and better load balance), but they generally outperform execution on conventional CPU-FPGA systems where no collaborative execution strategies are used. Therefore, we advocate even more integration in future heterogeneous CPU-FPGA systems (e.g., OpenCL 2.0 features, such as fine-grained shared virtual memory).
KW - CPU-FPGA architectures
KW - Heterogeneous systems
KW - OpenCL
KW - Performance analysis
UR - http://www.scopus.com/inward/record.url?scp=85064804067&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85064804067&partnerID=8YFLogxK
U2 - 10.1145/3297663.3310305
DO - 10.1145/3297663.3310305
M3 - Conference contribution
AN - SCOPUS:85064804067
T3 - ICPE 2019 - Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering
SP - 79
EP - 90
BT - ICPE 2019 - Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering
PB - Association for Computing Machinery
T2 - 10th ACM/SPEC International Conference on Performance Engineering, ICPE 2019
Y2 - 7 April 2019 through 11 April 2019
ER -