Analysis and modeling of collaborative execution strategies for heterogeneous CPU-FPGA architectures

Sitao Huang, Simon Garcia De Gonzalo, Mohamed El-Hadedy, Li Wen Chang, Juan Gómez-Luna, Dejan Milojicic, Izzat El Hajj, Sai Rahul Chalamalasetti, Onur Mutlu, Deming Chen, Wen-Mei W Hwu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Heterogeneous CPU-FPGA systems are evolving towards tighter integration between CPUs and FPGAs for improved performance and energy efficiency. At the same time, programmability is also improving with High Level Synthesis tools (e.g., OpenCL Software Development Kits), which allow programmers to express their designs with high-level programming languages, and avoid time-consuming and error-prone register-transfer level (RTL) programming. In the traditional loosely-coupled accelerator mode, FPGAs work as offload accelerators, where an entire kernel runs on the FPGA while the CPU thread waits for the result. However, tighter integration of the CPUs and the FPGAs enables the possibility of fine-grained collaborative execution, i.e., having both devices working concurrently on the same workload. Such collaborative execution makes better use of the overall system resources by employing both CPU threads and FPGA concurrency, thereby achieving higher performance. In this paper, we explore the potential of collaborative execution between CPUs and FPGAs using OpenCL High Level Synthesis. First, we compare various collaborative techniques (namely, data partitioning and task partitioning), and evaluate the tradeoffs between them. We observe that choosing the most suitable partitioning strategy can improve performance by up to 2×. Second, we study the impact of a common optimization technique, kernel duplication, in a collaborative CPU-FPGA context. We show that the general trend is that kernel duplication improves performance until the memory bandwidth saturates. Third, we provide new insights that application developers can use when designing CPU-FPGA collaborative applications to choose between different partitioning strategies. We find that different partitioning strategies pose different tradeoffs (e.g., task partitioning enables more kernel duplication, while data partitioning has lower communication overhead and better load balance), but they generally outperform execution on conventional CPU-FPGA systems where no collaborative execution strategies are used. Therefore, we advocate even more integration in future heterogeneous CPU-FPGA systems (e.g., OpenCL 2.0 features, such as fine-grained shared virtual memory).

Original languageEnglish (US)
Title of host publicationICPE 2019 - Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering
PublisherAssociation for Computing Machinery, Inc
Pages79-90
Number of pages12
ISBN (Electronic)9781450362399
DOIs
StatePublished - Apr 4 2019
Externally publishedYes
Event10th ACM/SPEC International Conference on Performance Engineering, ICPE 2019 - Mumbai, India
Duration: Apr 7 2019Apr 11 2019

Publication series

NameICPE 2019 - Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering

Conference

Conference10th ACM/SPEC International Conference on Performance Engineering, ICPE 2019
CountryIndia
CityMumbai
Period4/7/194/11/19

Fingerprint

Program processors
Field programmable gate arrays (FPGA)
Particle accelerators
Data storage equipment
Computer programming languages
Energy efficiency
Software engineering
Computer systems
Bandwidth
Communication

Keywords

  • CPU-FPGA architectures
  • Heterogeneous systems
  • OpenCL
  • Performance analysis

ASJC Scopus subject areas

  • Hardware and Architecture
  • Software
  • Computer Science Applications

Cite this

Huang, S., De Gonzalo, S. G., El-Hadedy, M., Chang, L. W., Gómez-Luna, J., Milojicic, D., ... Hwu, W-M. W. (2019). Analysis and modeling of collaborative execution strategies for heterogeneous CPU-FPGA architectures. In ICPE 2019 - Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering (pp. 79-90). (ICPE 2019 - Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering). Association for Computing Machinery, Inc. https://doi.org/10.1145/3297663.3310305

Analysis and modeling of collaborative execution strategies for heterogeneous CPU-FPGA architectures. / Huang, Sitao; De Gonzalo, Simon Garcia; El-Hadedy, Mohamed; Chang, Li Wen; Gómez-Luna, Juan; Milojicic, Dejan; El Hajj, Izzat; Chalamalasetti, Sai Rahul; Mutlu, Onur; Chen, Deming; Hwu, Wen-Mei W.

ICPE 2019 - Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering. Association for Computing Machinery, Inc, 2019. p. 79-90 (ICPE 2019 - Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Huang, S, De Gonzalo, SG, El-Hadedy, M, Chang, LW, Gómez-Luna, J, Milojicic, D, El Hajj, I, Chalamalasetti, SR, Mutlu, O, Chen, D & Hwu, W-MW 2019, Analysis and modeling of collaborative execution strategies for heterogeneous CPU-FPGA architectures. in ICPE 2019 - Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering. ICPE 2019 - Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering, Association for Computing Machinery, Inc, pp. 79-90, 10th ACM/SPEC International Conference on Performance Engineering, ICPE 2019, Mumbai, India, 4/7/19. https://doi.org/10.1145/3297663.3310305
Huang S, De Gonzalo SG, El-Hadedy M, Chang LW, Gómez-Luna J, Milojicic D et al. Analysis and modeling of collaborative execution strategies for heterogeneous CPU-FPGA architectures. In ICPE 2019 - Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering. Association for Computing Machinery, Inc. 2019. p. 79-90. (ICPE 2019 - Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering). https://doi.org/10.1145/3297663.3310305
Huang, Sitao ; De Gonzalo, Simon Garcia ; El-Hadedy, Mohamed ; Chang, Li Wen ; Gómez-Luna, Juan ; Milojicic, Dejan ; El Hajj, Izzat ; Chalamalasetti, Sai Rahul ; Mutlu, Onur ; Chen, Deming ; Hwu, Wen-Mei W. / Analysis and modeling of collaborative execution strategies for heterogeneous CPU-FPGA architectures. ICPE 2019 - Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering. Association for Computing Machinery, Inc, 2019. pp. 79-90 (ICPE 2019 - Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering).
@inproceedings{e3396f1e6d57459f9716cb18c57db3da,
title = "Analysis and modeling of collaborative execution strategies for heterogeneous CPU-FPGA architectures",
abstract = "Heterogeneous CPU-FPGA systems are evolving towards tighter integration between CPUs and FPGAs for improved performance and energy efficiency. At the same time, programmability is also improving with High Level Synthesis tools (e.g., OpenCL Software Development Kits), which allow programmers to express their designs with high-level programming languages, and avoid time-consuming and error-prone register-transfer level (RTL) programming. In the traditional loosely-coupled accelerator mode, FPGAs work as offload accelerators, where an entire kernel runs on the FPGA while the CPU thread waits for the result. However, tighter integration of the CPUs and the FPGAs enables the possibility of fine-grained collaborative execution, i.e., having both devices working concurrently on the same workload. Such collaborative execution makes better use of the overall system resources by employing both CPU threads and FPGA concurrency, thereby achieving higher performance. In this paper, we explore the potential of collaborative execution between CPUs and FPGAs using OpenCL High Level Synthesis. First, we compare various collaborative techniques (namely, data partitioning and task partitioning), and evaluate the tradeoffs between them. We observe that choosing the most suitable partitioning strategy can improve performance by up to 2×. Second, we study the impact of a common optimization technique, kernel duplication, in a collaborative CPU-FPGA context. We show that the general trend is that kernel duplication improves performance until the memory bandwidth saturates. Third, we provide new insights that application developers can use when designing CPU-FPGA collaborative applications to choose between different partitioning strategies. We find that different partitioning strategies pose different tradeoffs (e.g., task partitioning enables more kernel duplication, while data partitioning has lower communication overhead and better load balance), but they generally outperform execution on conventional CPU-FPGA systems where no collaborative execution strategies are used. Therefore, we advocate even more integration in future heterogeneous CPU-FPGA systems (e.g., OpenCL 2.0 features, such as fine-grained shared virtual memory).",
keywords = "CPU-FPGA architectures, Heterogeneous systems, OpenCL, Performance analysis",
author = "Sitao Huang and {De Gonzalo}, {Simon Garcia} and Mohamed El-Hadedy and Chang, {Li Wen} and Juan G{\'o}mez-Luna and Dejan Milojicic and {El Hajj}, Izzat and Chalamalasetti, {Sai Rahul} and Onur Mutlu and Deming Chen and Hwu, {Wen-Mei W}",
year = "2019",
month = "4",
day = "4",
doi = "10.1145/3297663.3310305",
language = "English (US)",
series = "ICPE 2019 - Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering",
publisher = "Association for Computing Machinery, Inc",
pages = "79--90",
booktitle = "ICPE 2019 - Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering",

}

TY - GEN

T1 - Analysis and modeling of collaborative execution strategies for heterogeneous CPU-FPGA architectures

AU - Huang, Sitao

AU - De Gonzalo, Simon Garcia

AU - El-Hadedy, Mohamed

AU - Chang, Li Wen

AU - Gómez-Luna, Juan

AU - Milojicic, Dejan

AU - El Hajj, Izzat

AU - Chalamalasetti, Sai Rahul

AU - Mutlu, Onur

AU - Chen, Deming

AU - Hwu, Wen-Mei W

PY - 2019/4/4

Y1 - 2019/4/4

N2 - Heterogeneous CPU-FPGA systems are evolving towards tighter integration between CPUs and FPGAs for improved performance and energy efficiency. At the same time, programmability is also improving with High Level Synthesis tools (e.g., OpenCL Software Development Kits), which allow programmers to express their designs with high-level programming languages, and avoid time-consuming and error-prone register-transfer level (RTL) programming. In the traditional loosely-coupled accelerator mode, FPGAs work as offload accelerators, where an entire kernel runs on the FPGA while the CPU thread waits for the result. However, tighter integration of the CPUs and the FPGAs enables the possibility of fine-grained collaborative execution, i.e., having both devices working concurrently on the same workload. Such collaborative execution makes better use of the overall system resources by employing both CPU threads and FPGA concurrency, thereby achieving higher performance. In this paper, we explore the potential of collaborative execution between CPUs and FPGAs using OpenCL High Level Synthesis. First, we compare various collaborative techniques (namely, data partitioning and task partitioning), and evaluate the tradeoffs between them. We observe that choosing the most suitable partitioning strategy can improve performance by up to 2×. Second, we study the impact of a common optimization technique, kernel duplication, in a collaborative CPU-FPGA context. We show that the general trend is that kernel duplication improves performance until the memory bandwidth saturates. Third, we provide new insights that application developers can use when designing CPU-FPGA collaborative applications to choose between different partitioning strategies. We find that different partitioning strategies pose different tradeoffs (e.g., task partitioning enables more kernel duplication, while data partitioning has lower communication overhead and better load balance), but they generally outperform execution on conventional CPU-FPGA systems where no collaborative execution strategies are used. Therefore, we advocate even more integration in future heterogeneous CPU-FPGA systems (e.g., OpenCL 2.0 features, such as fine-grained shared virtual memory).

AB - Heterogeneous CPU-FPGA systems are evolving towards tighter integration between CPUs and FPGAs for improved performance and energy efficiency. At the same time, programmability is also improving with High Level Synthesis tools (e.g., OpenCL Software Development Kits), which allow programmers to express their designs with high-level programming languages, and avoid time-consuming and error-prone register-transfer level (RTL) programming. In the traditional loosely-coupled accelerator mode, FPGAs work as offload accelerators, where an entire kernel runs on the FPGA while the CPU thread waits for the result. However, tighter integration of the CPUs and the FPGAs enables the possibility of fine-grained collaborative execution, i.e., having both devices working concurrently on the same workload. Such collaborative execution makes better use of the overall system resources by employing both CPU threads and FPGA concurrency, thereby achieving higher performance. In this paper, we explore the potential of collaborative execution between CPUs and FPGAs using OpenCL High Level Synthesis. First, we compare various collaborative techniques (namely, data partitioning and task partitioning), and evaluate the tradeoffs between them. We observe that choosing the most suitable partitioning strategy can improve performance by up to 2×. Second, we study the impact of a common optimization technique, kernel duplication, in a collaborative CPU-FPGA context. We show that the general trend is that kernel duplication improves performance until the memory bandwidth saturates. Third, we provide new insights that application developers can use when designing CPU-FPGA collaborative applications to choose between different partitioning strategies. We find that different partitioning strategies pose different tradeoffs (e.g., task partitioning enables more kernel duplication, while data partitioning has lower communication overhead and better load balance), but they generally outperform execution on conventional CPU-FPGA systems where no collaborative execution strategies are used. Therefore, we advocate even more integration in future heterogeneous CPU-FPGA systems (e.g., OpenCL 2.0 features, such as fine-grained shared virtual memory).

KW - CPU-FPGA architectures

KW - Heterogeneous systems

KW - OpenCL

KW - Performance analysis

UR - http://www.scopus.com/inward/record.url?scp=85064804067&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85064804067&partnerID=8YFLogxK

U2 - 10.1145/3297663.3310305

DO - 10.1145/3297663.3310305

M3 - Conference contribution

T3 - ICPE 2019 - Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering

SP - 79

EP - 90

BT - ICPE 2019 - Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering

PB - Association for Computing Machinery, Inc

ER -