FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow

Yao Chen, Swathi T. Gurumani, Yun Liang, Guofeng Li, Donghui Guo, Kyle Rupnow, Deming Chen

Research output: Contribution to journalArticlepeer-review


High-level synthesis (HLS) of data-parallel input languages, such as the Compute Unified Device Architecture (CUDA), enables efficient description and implementation of independent computation cores. HLS tools can effectively translate the many threads of computation present in the parallel descriptions into independent, optimized cores. The generated hardware cores often heavily share input data and produce outputs independently. As the number of instantiated cores grows, the off-chip memory bandwidth may be insufficient to meet the demand. Hence, a scalable system architecture and a data-sharing mechanism become necessary for improving system performance. The network-on-chip (NoC) paradigm for intrachip communication has proved to be an efficient alternative to a hierarchical bus or crossbar interconnect, since it can reduce wire routing congestion, and has higher operating frequencies and better scalability for adding new nodes. In this paper, we present a customizable NoC architecture along with a directory-based data-sharing mechanism for an existing CUDA-to-FPGA (FCUDA) flow to enable scalability of our system and improve overall system performance. We build a fully automated FCUDA-NoC generator that takes in CUDA code and custom network parameters as inputs and produces synthesizable register transfer level (RTL) code for the entire NoC system. We implement the NoC system on a VC709 Xilinx evaluation board and evaluate our architecture with a set of benchmarks. The results demonstrate that our FCUDA-NoC design is scalable and efficient and we improve the system execution time by up to 63 × and reduce external memory reads by up to 81% compared with a single hardware core implementation.

Original languageEnglish (US)
Article number7349212
Pages (from-to)2220-2233
Number of pages14
JournalIEEE Transactions on Very Large Scale Integration (VLSI) Systems
Issue number6
StatePublished - Jun 2016


  • CUDA
  • Parallel languages
  • high-level synthesis (HLS)
  • network-on-chip (NoC)

ASJC Scopus subject areas

  • Software
  • Hardware and Architecture
  • Electrical and Electronic Engineering


Dive into the research topics of 'FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow'. Together they form a unique fingerprint.

Cite this