TY - GEN
T1 - High-level synthesis of multiple dependent CUDA kernels on FPGA
AU - Gurumani, Swathi T.
AU - Cholakkal, Hisham
AU - Liang, Yun
AU - Rupnow, Kyle
AU - Chen, Deming
PY - 2013
Y1 - 2013
N2 - High-level synthesis (HLS) tools provide automatic generation of hardware at the register transfer level (RTL) from algorithm descriptions written in high-level languages, enabling faster creation of custom accelerators for FPGA architectures. Existing HLS tools support a wide variety of input languages, and assist users in design space exploration through automation and feedback on designs' performance bottlenecks. This design space exploration applies techniques such as pipelining, partitioning and resource sharing in order to improve performance, and resource utilization. However, although automated exploration can find some inherent parallelism, data-parallel input source code is still superior for exposing a greater variety of parallelism. In prior work, we demonstrated automated design space exploration of GPU multi-threaded (CUDA) language source code for efficient RTL generation. In this paper, we examine the challenges in extending this automated design space exploration to multiple dependent CUDA kernels, demonstrate a step-by-step procedure for efficiently performing multi-kernel synthesis, and demonstrate the potential of this approach through a case study of a stereo matching algorithm. This study demonstrates that HLS of multiple dependent CUDA kernels can maintain performance parity with the GPU implementation, while consuming over 16X less energy than the GPU. Based on our manual procedure, we identify the key challenges in fully automating the synthesis of multi-kernel CUDA programs.
AB - High-level synthesis (HLS) tools provide automatic generation of hardware at the register transfer level (RTL) from algorithm descriptions written in high-level languages, enabling faster creation of custom accelerators for FPGA architectures. Existing HLS tools support a wide variety of input languages, and assist users in design space exploration through automation and feedback on designs' performance bottlenecks. This design space exploration applies techniques such as pipelining, partitioning and resource sharing in order to improve performance, and resource utilization. However, although automated exploration can find some inherent parallelism, data-parallel input source code is still superior for exposing a greater variety of parallelism. In prior work, we demonstrated automated design space exploration of GPU multi-threaded (CUDA) language source code for efficient RTL generation. In this paper, we examine the challenges in extending this automated design space exploration to multiple dependent CUDA kernels, demonstrate a step-by-step procedure for efficiently performing multi-kernel synthesis, and demonstrate the potential of this approach through a case study of a stereo matching algorithm. This study demonstrates that HLS of multiple dependent CUDA kernels can maintain performance parity with the GPU implementation, while consuming over 16X less energy than the GPU. Based on our manual procedure, we identify the key challenges in fully automating the synthesis of multi-kernel CUDA programs.
UR - http://www.scopus.com/inward/record.url?scp=84877764003&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84877764003&partnerID=8YFLogxK
U2 - 10.1109/ASPDAC.2013.6509613
DO - 10.1109/ASPDAC.2013.6509613
M3 - Conference contribution
AN - SCOPUS:84877764003
SN - 9781467330299
T3 - Proceedings of the Asia and South Pacific Design Automation Conference, ASP-DAC
SP - 305
EP - 312
BT - 2013 18th Asia and South Pacific Design Automation Conference, ASP-DAC 2013
T2 - 2013 18th Asia and South Pacific Design Automation Conference, ASP-DAC 2013
Y2 - 22 January 2013 through 25 January 2013
ER -