TY - GEN
T1 - AccDNN
T2 - 26th Annual IEEE International Symposium on Field-Programmable Custom Computing Machines, FCCM 2018
AU - Zhang, Xiaofan
AU - Wang, Junsong
AU - Zhu, Chao
AU - Lin, Yonghua
AU - Xiong, Jinjun
AU - Hwu, Wen Mei
AU - Chen, Deming
N1 - Publisher Copyright:
© 2018 IEEE.
PY - 2018/9/7
Y1 - 2018/9/7
N2 - Using FPGA to accelerate Deep Neural Networks (DNNs) requires RTL programming, hardware verification, and precise resource allocation, which is both time-consuming and challenging. To address this issue, we present AccDNN, an end-to-end automation tool that can generate high-performance DNN designs on FPGAs automatically. Highlights of this tool include high-quality RTL network layer IPs, a fine-grained layer-based pipeline architecture, and a column-based cache scheme for high throughput, low latency, and reduced on-chip memory utilization. AccDNN also includes an automatic design space exploration tool, called A-REALM, used to generate optimized parallelism schemes by considering external memory access bandwidth, data reuse behaviors, resource availability, and network complexity. We demonstrate AccDNN on four DNNs (Alexnet, ZF, VGG16, and YOLO) on two Xilinx FPGAs (ZC706 and KU115) for edge- and cloud-computing, respectively. AccDNN generates designs that deliver 263 GOPS and 36.4 GOPS/W on ZC706 without any batching and 2109 GOPS and 94.5 GOPS/W on KU115.
AB - Using FPGA to accelerate Deep Neural Networks (DNNs) requires RTL programming, hardware verification, and precise resource allocation, which is both time-consuming and challenging. To address this issue, we present AccDNN, an end-to-end automation tool that can generate high-performance DNN designs on FPGAs automatically. Highlights of this tool include high-quality RTL network layer IPs, a fine-grained layer-based pipeline architecture, and a column-based cache scheme for high throughput, low latency, and reduced on-chip memory utilization. AccDNN also includes an automatic design space exploration tool, called A-REALM, used to generate optimized parallelism schemes by considering external memory access bandwidth, data reuse behaviors, resource availability, and network complexity. We demonstrate AccDNN on four DNNs (Alexnet, ZF, VGG16, and YOLO) on two Xilinx FPGAs (ZC706 and KU115) for edge- and cloud-computing, respectively. AccDNN generates designs that deliver 263 GOPS and 36.4 GOPS/W on ZC706 without any batching and 2109 GOPS and 94.5 GOPS/W on KU115.
KW - Acceleration
KW - Automation tool
KW - Deep Neural Network
KW - FPGA
UR - http://www.scopus.com/inward/record.url?scp=85057752883&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85057752883&partnerID=8YFLogxK
U2 - 10.1109/FCCM.2018.00044
DO - 10.1109/FCCM.2018.00044
M3 - Conference contribution
AN - SCOPUS:85057752883
T3 - Proceedings - 26th IEEE International Symposium on Field-Programmable Custom Computing Machines, FCCM 2018
SP - 210
BT - Proceedings - 26th IEEE International Symposium on Field-Programmable Custom Computing Machines, FCCM 2018
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 29 April 2018 through 1 May 2018
ER -