TY - GEN
T1 - FCUDA-SoC
T2 - 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA 2016
AU - Nguyen, Tan
AU - Gurumani, Swathi
AU - Rupnow, Kyle
AU - Chen, Deming
N1 - Publisher Copyright:
© 2016 ACM.
PY - 2016/2/21
Y1 - 2016/2/21
N2 - Throughput oriented high level synthesis allows efficient de- sign and optimization using parallel input languages. Par- allel languages offer the benefit of parallelism extraction at multiple levels of granularity, offering effective design space exploration to select efficient single core implementations, and easy scaling of parallelism through multiple core instan- tiations. However, study of high level synthesis for paral- lel languages has concentrated on optimization of core and on-chip communications, while neglecting platform integra- tion, which can have a significant impact on achieved per- formance. In this paper, we create an automated flow to perform efficient platform integration for an existing CUDA- to-RTL throughput oriented HLS, and we open source the FCUDA tool, platform integration, and benchmark appli- cations. We demonstrate platform integration of 16 bench- marks on two Zynq-based systems in bare-metal and OS mode. We study implementation optimization for platform integration, compare to an embedded GPU (Tegra TK1) and verify designs on a Zedboard Zynq 7020 (bare-metal) and Omnitek Zynq 7045 (OS).
AB - Throughput oriented high level synthesis allows efficient de- sign and optimization using parallel input languages. Par- allel languages offer the benefit of parallelism extraction at multiple levels of granularity, offering effective design space exploration to select efficient single core implementations, and easy scaling of parallelism through multiple core instan- tiations. However, study of high level synthesis for paral- lel languages has concentrated on optimization of core and on-chip communications, while neglecting platform integra- tion, which can have a significant impact on achieved per- formance. In this paper, we create an automated flow to perform efficient platform integration for an existing CUDA- to-RTL throughput oriented HLS, and we open source the FCUDA tool, platform integration, and benchmark appli- cations. We demonstrate platform integration of 16 bench- marks on two Zynq-based systems in bare-metal and OS mode. We study implementation optimization for platform integration, compare to an embedded GPU (Tegra TK1) and verify designs on a Zedboard Zynq 7020 (bare-metal) and Omnitek Zynq 7045 (OS).
UR - http://www.scopus.com/inward/record.url?scp=84966565331&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84966565331&partnerID=8YFLogxK
U2 - 10.1145/2847263.2847344
DO - 10.1145/2847263.2847344
M3 - Conference contribution
AN - SCOPUS:84966565331
T3 - FPGA 2016 - Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays
SP - 5
EP - 14
BT - FPGA 2016 - Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays
PB - Association for Computing Machinery
Y2 - 21 February 2016 through 23 February 2016
ER -