TY - GEN
T1 - DNNBuilder
T2 - 37th IEEE/ACM International Conference on Computer-Aided Design, ICCAD 2018
AU - Zhang, Xiaofan
AU - Wang, Junsong
AU - Zhu, Chao
AU - Lin, Yonghua
AU - Xiong, Jinjun
AU - Hwu, Wen Mei
AU - Chen, Deming
N1 - Funding Information:
that not only provides optimal implementations of diversified DNN layers but also allows us to adjust the parallelism factors (CPFs and KPFs) to fit in the resource allocation guidelines. We designed an automatic resource allocation algorithm to enable design space exploration and generate parallelism schemes under constraints of computation resource, on-chip memory capacity, and external memory access bandwidth. Because of the above novel designs, we reached the highest throughput performance peaking at 4218 GOPS (KU115) and 526 GOPS (ZC706) compared to the existing FPGNembedded FPGA based solutions. We also achieved higher efficiency (up to 4.35x) than the GPU based solutions. ACKNOWLEDGMENT This work was partly supported by the IBM-lllinois Center for Cognitive Computing System Research (C3SR) - a research collaboration as part of IBM AI Horizons Network. REFERENCES
Publisher Copyright:
© 2018 ACM.
PY - 2018/11/5
Y1 - 2018/11/5
N2 - Building a high-performance EPGA accelerator for Deep Neural Networks (DNNs) often requires RTL programming, hardware verification, and precise resource allocation, all of which can be time-consuming and challenging to perform even for seasoned FPGA developers. To bridge the gap between fast DNN construction in software (e.g., Caffe, TensorFlow) and slow hardware implementation, we propose DNNBuilder for building high-performance DNN hardware accelerators on FPGAs automatically. Novel techniques are developed to meet the throughput and latency requirements for both cloud- and edge-devices. A number of novel techniques including high-quality RTL neural network components, a fine-grained layer-based pipeline architecture, and a column-based cache scheme are developed to boost throughput, reduce latency, and save FPGA on-chip memory. To address the limited resource challenge, we design an automatic design space exploration tool to generate optimized parallelism guidelines by considering external memory access bandwidth, data reuse behaviors, FPGA resource availability, and DNN complexity. DNNBuilder is demonstrated on four DNNs (Alexnet, ZF, VGG16, and YOLO) on two FPGAs (XC7Z045 and KU115) corresponding to the edge- and cloud-computing, respectively. The fine-grained layer-based pipeline architecture and the column-based cache scheme contribute to 7.7x and 43x reduction of the latency and BRAM utilization compared to conventional designs. We achieve the best performance (up to 5.15x faster) and efficiency (up to 5.88x more efficient) compared to published FPGA-based classification-oriented DNN accelerators for both edge and cloud computing cases. We reach 4218 GOPS for running object detection DNN which is the highest throughput reported to the best of our knowledge. DNNBuilder can provide millisecond-scale real-time performance for processing HD video input and deliver higher efficiency (up to 4.35x) than the GPU-based solutions.
AB - Building a high-performance EPGA accelerator for Deep Neural Networks (DNNs) often requires RTL programming, hardware verification, and precise resource allocation, all of which can be time-consuming and challenging to perform even for seasoned FPGA developers. To bridge the gap between fast DNN construction in software (e.g., Caffe, TensorFlow) and slow hardware implementation, we propose DNNBuilder for building high-performance DNN hardware accelerators on FPGAs automatically. Novel techniques are developed to meet the throughput and latency requirements for both cloud- and edge-devices. A number of novel techniques including high-quality RTL neural network components, a fine-grained layer-based pipeline architecture, and a column-based cache scheme are developed to boost throughput, reduce latency, and save FPGA on-chip memory. To address the limited resource challenge, we design an automatic design space exploration tool to generate optimized parallelism guidelines by considering external memory access bandwidth, data reuse behaviors, FPGA resource availability, and DNN complexity. DNNBuilder is demonstrated on four DNNs (Alexnet, ZF, VGG16, and YOLO) on two FPGAs (XC7Z045 and KU115) corresponding to the edge- and cloud-computing, respectively. The fine-grained layer-based pipeline architecture and the column-based cache scheme contribute to 7.7x and 43x reduction of the latency and BRAM utilization compared to conventional designs. We achieve the best performance (up to 5.15x faster) and efficiency (up to 5.88x more efficient) compared to published FPGA-based classification-oriented DNN accelerators for both edge and cloud computing cases. We reach 4218 GOPS for running object detection DNN which is the highest throughput reported to the best of our knowledge. DNNBuilder can provide millisecond-scale real-time performance for processing HD video input and deliver higher efficiency (up to 4.35x) than the GPU-based solutions.
UR - http://www.scopus.com/inward/record.url?scp=85058185331&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85058185331&partnerID=8YFLogxK
U2 - 10.1145/3240765.3240801
DO - 10.1145/3240765.3240801
M3 - Conference contribution
AN - SCOPUS:85058185331
T3 - IEEE/ACM International Conference on Computer-Aided Design, Digest of Technical Papers, ICCAD
BT - 2018 IEEE/ACM International Conference on Computer-Aided Design, ICCAD 2018 - Digest of Technical Papers
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 5 November 2018 through 8 November 2018
ER -