TY - JOUR
T1 - FCUDA-NoC
T2 - A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow
AU - Chen, Yao
AU - Gurumani, Swathi T.
AU - Liang, Yun
AU - Li, Guofeng
AU - Guo, Donghui
AU - Rupnow, Kyle
AU - Chen, Deming
N1 - Publisher Copyright:
© 2016 IEEE.
PY - 2016/6
Y1 - 2016/6
N2 - High-level synthesis (HLS) of data-parallel input languages, such as the Compute Unified Device Architecture (CUDA), enables efficient description and implementation of independent computation cores. HLS tools can effectively translate the many threads of computation present in the parallel descriptions into independent, optimized cores. The generated hardware cores often heavily share input data and produce outputs independently. As the number of instantiated cores grows, the off-chip memory bandwidth may be insufficient to meet the demand. Hence, a scalable system architecture and a data-sharing mechanism become necessary for improving system performance. The network-on-chip (NoC) paradigm for intrachip communication has proved to be an efficient alternative to a hierarchical bus or crossbar interconnect, since it can reduce wire routing congestion, and has higher operating frequencies and better scalability for adding new nodes. In this paper, we present a customizable NoC architecture along with a directory-based data-sharing mechanism for an existing CUDA-to-FPGA (FCUDA) flow to enable scalability of our system and improve overall system performance. We build a fully automated FCUDA-NoC generator that takes in CUDA code and custom network parameters as inputs and produces synthesizable register transfer level (RTL) code for the entire NoC system. We implement the NoC system on a VC709 Xilinx evaluation board and evaluate our architecture with a set of benchmarks. The results demonstrate that our FCUDA-NoC design is scalable and efficient and we improve the system execution time by up to 63 × and reduce external memory reads by up to 81% compared with a single hardware core implementation.
AB - High-level synthesis (HLS) of data-parallel input languages, such as the Compute Unified Device Architecture (CUDA), enables efficient description and implementation of independent computation cores. HLS tools can effectively translate the many threads of computation present in the parallel descriptions into independent, optimized cores. The generated hardware cores often heavily share input data and produce outputs independently. As the number of instantiated cores grows, the off-chip memory bandwidth may be insufficient to meet the demand. Hence, a scalable system architecture and a data-sharing mechanism become necessary for improving system performance. The network-on-chip (NoC) paradigm for intrachip communication has proved to be an efficient alternative to a hierarchical bus or crossbar interconnect, since it can reduce wire routing congestion, and has higher operating frequencies and better scalability for adding new nodes. In this paper, we present a customizable NoC architecture along with a directory-based data-sharing mechanism for an existing CUDA-to-FPGA (FCUDA) flow to enable scalability of our system and improve overall system performance. We build a fully automated FCUDA-NoC generator that takes in CUDA code and custom network parameters as inputs and produces synthesizable register transfer level (RTL) code for the entire NoC system. We implement the NoC system on a VC709 Xilinx evaluation board and evaluate our architecture with a set of benchmarks. The results demonstrate that our FCUDA-NoC design is scalable and efficient and we improve the system execution time by up to 63 × and reduce external memory reads by up to 81% compared with a single hardware core implementation.
KW - CUDA
KW - Parallel languages
KW - high-level synthesis (HLS)
KW - network-on-chip (NoC)
UR - http://www.scopus.com/inward/record.url?scp=84949844733&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84949844733&partnerID=8YFLogxK
U2 - 10.1109/TVLSI.2015.2497259
DO - 10.1109/TVLSI.2015.2497259
M3 - Article
AN - SCOPUS:84949844733
SN - 1063-8210
VL - 24
SP - 2220
EP - 2233
JO - IEEE Transactions on Very Large Scale Integration (VLSI) Systems
JF - IEEE Transactions on Very Large Scale Integration (VLSI) Systems
IS - 6
M1 - 7349212
ER -