TY - GEN
T1 - Integrated CUDA-to-FPGA synthesis with network-on-chip
AU - Gurumani, Swathi T.
AU - Tolar, Jacob
AU - Chen, Yao
AU - Liang, Yun
AU - Rupnow, Kyle
AU - Chen, Deming
N1 - Publisher Copyright:
© 2014 IEEE.
PY - 2014/7/21
Y1 - 2014/7/21
N2 - Data parallel languages such as CUDA and OpenCL efficiently describe many parallel threads of computation, and HLS tools can effectively translate these descriptions into independent optimized cores. As the number of instantiated cores grows, average external memory access latency can be a significant factor in system performance. However, although each core produces outputs independently, the cores often heavily share input data. Exploiting on-chip data sharing both reduces external bandwidth demand and improves the average memory access latency, allowing the system to improve performance at the same number of cores. In this paper, we develop a network-on-chip coupled with computation cores synthesized from CUDA for FPGAs that enables on-chip data sharing. We demonstrate reduced external bandwidth demand by up to 60% (average 56%) and total application latency in cycles by up to 43% (average 27%).
AB - Data parallel languages such as CUDA and OpenCL efficiently describe many parallel threads of computation, and HLS tools can effectively translate these descriptions into independent optimized cores. As the number of instantiated cores grows, average external memory access latency can be a significant factor in system performance. However, although each core produces outputs independently, the cores often heavily share input data. Exploiting on-chip data sharing both reduces external bandwidth demand and improves the average memory access latency, allowing the system to improve performance at the same number of cores. In this paper, we develop a network-on-chip coupled with computation cores synthesized from CUDA for FPGAs that enables on-chip data sharing. We demonstrate reduced external bandwidth demand by up to 60% (average 56%) and total application latency in cycles by up to 43% (average 27%).
UR - http://www.scopus.com/inward/record.url?scp=84912523874&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84912523874&partnerID=8YFLogxK
U2 - 10.1109/FCCM.2014.14
DO - 10.1109/FCCM.2014.14
M3 - Conference contribution
AN - SCOPUS:84912523874
T3 - Proceedings - 2014 IEEE 22nd International Symposium on Field-Programmable Custom Computing Machines, FCCM 2014
SP - 21
EP - 24
BT - Proceedings - 2014 IEEE 22nd International Symposium on Field-Programmable Custom Computing Machines, FCCM 2014
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 22nd IEEE International Symposium on Field-Programmable Custom Computing Machines, FCCM 2014
Y2 - 11 May 2014 through 13 May 2014
ER -