DNNBuilder: An automated tool for building high-performance DNN hardware accelerators for FPGAs

Xiaofan Zhang, Junsong Wang, Chao Zhu, Yonghua Lin, Jinjun Xiong, Wen-Mei W Hwu, Deming Chen

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Building a high-performance EPGA accelerator for Deep Neural Networks (DNNs) often requires RTL programming, hardware verification, and precise resource allocation, all of which can be time-consuming and challenging to perform even for seasoned FPGA developers. To bridge the gap between fast DNN construction in software (e.g., Caffe, TensorFlow) and slow hardware implementation, we propose DNNBuilder for building high-performance DNN hardware accelerators on FPGAs automatically. Novel techniques are developed to meet the throughput and latency requirements for both cloud- and edge-devices. A number of novel techniques including high-quality RTL neural network components, a fine-grained layer-based pipeline architecture, and a column-based cache scheme are developed to boost throughput, reduce latency, and save FPGA on-chip memory. To address the limited resource challenge, we design an automatic design space exploration tool to generate optimized parallelism guidelines by considering external memory access bandwidth, data reuse behaviors, FPGA resource availability, and DNN complexity. DNNBuilder is demonstrated on four DNNs (Alexnet, ZF, VGG16, and YOLO) on two FPGAs (XC7Z045 and KU115) corresponding to the edge- and cloud-computing, respectively. The fine-grained layer-based pipeline architecture and the column-based cache scheme contribute to 7.7x and 43x reduction of the latency and BRAM utilization compared to conventional designs. We achieve the best performance (up to 5.15x faster) and efficiency (up to 5.88x more efficient) compared to published FPGA-based classification-oriented DNN accelerators for both edge and cloud computing cases. We reach 4218 GOPS for running object detection DNN which is the highest throughput reported to the best of our knowledge. DNNBuilder can provide millisecond-scale real-time performance for processing HD video input and deliver higher efficiency (up to 4.35x) than the GPU-based solutions.

Original languageEnglish (US)
Title of host publication2018 IEEE/ACM International Conference on Computer-Aided Design, ICCAD 2018 - Digest of Technical Papers
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9781450359504
DOIs
StatePublished - Nov 5 2018
Event37th IEEE/ACM International Conference on Computer-Aided Design, ICCAD 2018 - San Diego, United States
Duration: Nov 5 2018Nov 8 2018

Publication series

NameIEEE/ACM International Conference on Computer-Aided Design, Digest of Technical Papers, ICCAD
ISSN (Print)1092-3152

Other

Other37th IEEE/ACM International Conference on Computer-Aided Design, ICCAD 2018
CountryUnited States
CitySan Diego
Period11/5/1811/8/18

Fingerprint

Particle accelerators
Field programmable gate arrays (FPGA)
Hardware
Throughput
Cloud computing
Pipelines
Data storage equipment
Network components
Deep neural networks
Resource allocation
Availability
Neural networks
Bandwidth
Processing

ASJC Scopus subject areas

  • Software
  • Computer Science Applications
  • Computer Graphics and Computer-Aided Design

Cite this

Zhang, X., Wang, J., Zhu, C., Lin, Y., Xiong, J., Hwu, W-M. W., & Chen, D. (2018). DNNBuilder: An automated tool for building high-performance DNN hardware accelerators for FPGAs. In 2018 IEEE/ACM International Conference on Computer-Aided Design, ICCAD 2018 - Digest of Technical Papers [a56] (IEEE/ACM International Conference on Computer-Aided Design, Digest of Technical Papers, ICCAD). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1145/3240765.3240801

DNNBuilder : An automated tool for building high-performance DNN hardware accelerators for FPGAs. / Zhang, Xiaofan; Wang, Junsong; Zhu, Chao; Lin, Yonghua; Xiong, Jinjun; Hwu, Wen-Mei W; Chen, Deming.

2018 IEEE/ACM International Conference on Computer-Aided Design, ICCAD 2018 - Digest of Technical Papers. Institute of Electrical and Electronics Engineers Inc., 2018. a56 (IEEE/ACM International Conference on Computer-Aided Design, Digest of Technical Papers, ICCAD).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Zhang, X, Wang, J, Zhu, C, Lin, Y, Xiong, J, Hwu, W-MW & Chen, D 2018, DNNBuilder: An automated tool for building high-performance DNN hardware accelerators for FPGAs. in 2018 IEEE/ACM International Conference on Computer-Aided Design, ICCAD 2018 - Digest of Technical Papers., a56, IEEE/ACM International Conference on Computer-Aided Design, Digest of Technical Papers, ICCAD, Institute of Electrical and Electronics Engineers Inc., 37th IEEE/ACM International Conference on Computer-Aided Design, ICCAD 2018, San Diego, United States, 11/5/18. https://doi.org/10.1145/3240765.3240801
Zhang X, Wang J, Zhu C, Lin Y, Xiong J, Hwu W-MW et al. DNNBuilder: An automated tool for building high-performance DNN hardware accelerators for FPGAs. In 2018 IEEE/ACM International Conference on Computer-Aided Design, ICCAD 2018 - Digest of Technical Papers. Institute of Electrical and Electronics Engineers Inc. 2018. a56. (IEEE/ACM International Conference on Computer-Aided Design, Digest of Technical Papers, ICCAD). https://doi.org/10.1145/3240765.3240801
Zhang, Xiaofan ; Wang, Junsong ; Zhu, Chao ; Lin, Yonghua ; Xiong, Jinjun ; Hwu, Wen-Mei W ; Chen, Deming. / DNNBuilder : An automated tool for building high-performance DNN hardware accelerators for FPGAs. 2018 IEEE/ACM International Conference on Computer-Aided Design, ICCAD 2018 - Digest of Technical Papers. Institute of Electrical and Electronics Engineers Inc., 2018. (IEEE/ACM International Conference on Computer-Aided Design, Digest of Technical Papers, ICCAD).
@inproceedings{b38af3a2480f46c0a94af20fafc62fd1,
title = "DNNBuilder: An automated tool for building high-performance DNN hardware accelerators for FPGAs",
abstract = "Building a high-performance EPGA accelerator for Deep Neural Networks (DNNs) often requires RTL programming, hardware verification, and precise resource allocation, all of which can be time-consuming and challenging to perform even for seasoned FPGA developers. To bridge the gap between fast DNN construction in software (e.g., Caffe, TensorFlow) and slow hardware implementation, we propose DNNBuilder for building high-performance DNN hardware accelerators on FPGAs automatically. Novel techniques are developed to meet the throughput and latency requirements for both cloud- and edge-devices. A number of novel techniques including high-quality RTL neural network components, a fine-grained layer-based pipeline architecture, and a column-based cache scheme are developed to boost throughput, reduce latency, and save FPGA on-chip memory. To address the limited resource challenge, we design an automatic design space exploration tool to generate optimized parallelism guidelines by considering external memory access bandwidth, data reuse behaviors, FPGA resource availability, and DNN complexity. DNNBuilder is demonstrated on four DNNs (Alexnet, ZF, VGG16, and YOLO) on two FPGAs (XC7Z045 and KU115) corresponding to the edge- and cloud-computing, respectively. The fine-grained layer-based pipeline architecture and the column-based cache scheme contribute to 7.7x and 43x reduction of the latency and BRAM utilization compared to conventional designs. We achieve the best performance (up to 5.15x faster) and efficiency (up to 5.88x more efficient) compared to published FPGA-based classification-oriented DNN accelerators for both edge and cloud computing cases. We reach 4218 GOPS for running object detection DNN which is the highest throughput reported to the best of our knowledge. DNNBuilder can provide millisecond-scale real-time performance for processing HD video input and deliver higher efficiency (up to 4.35x) than the GPU-based solutions.",
author = "Xiaofan Zhang and Junsong Wang and Chao Zhu and Yonghua Lin and Jinjun Xiong and Hwu, {Wen-Mei W} and Deming Chen",
year = "2018",
month = "11",
day = "5",
doi = "10.1145/3240765.3240801",
language = "English (US)",
series = "IEEE/ACM International Conference on Computer-Aided Design, Digest of Technical Papers, ICCAD",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
booktitle = "2018 IEEE/ACM International Conference on Computer-Aided Design, ICCAD 2018 - Digest of Technical Papers",
address = "United States",

}

TY - GEN

T1 - DNNBuilder

T2 - An automated tool for building high-performance DNN hardware accelerators for FPGAs

AU - Zhang, Xiaofan

AU - Wang, Junsong

AU - Zhu, Chao

AU - Lin, Yonghua

AU - Xiong, Jinjun

AU - Hwu, Wen-Mei W

AU - Chen, Deming

PY - 2018/11/5

Y1 - 2018/11/5

N2 - Building a high-performance EPGA accelerator for Deep Neural Networks (DNNs) often requires RTL programming, hardware verification, and precise resource allocation, all of which can be time-consuming and challenging to perform even for seasoned FPGA developers. To bridge the gap between fast DNN construction in software (e.g., Caffe, TensorFlow) and slow hardware implementation, we propose DNNBuilder for building high-performance DNN hardware accelerators on FPGAs automatically. Novel techniques are developed to meet the throughput and latency requirements for both cloud- and edge-devices. A number of novel techniques including high-quality RTL neural network components, a fine-grained layer-based pipeline architecture, and a column-based cache scheme are developed to boost throughput, reduce latency, and save FPGA on-chip memory. To address the limited resource challenge, we design an automatic design space exploration tool to generate optimized parallelism guidelines by considering external memory access bandwidth, data reuse behaviors, FPGA resource availability, and DNN complexity. DNNBuilder is demonstrated on four DNNs (Alexnet, ZF, VGG16, and YOLO) on two FPGAs (XC7Z045 and KU115) corresponding to the edge- and cloud-computing, respectively. The fine-grained layer-based pipeline architecture and the column-based cache scheme contribute to 7.7x and 43x reduction of the latency and BRAM utilization compared to conventional designs. We achieve the best performance (up to 5.15x faster) and efficiency (up to 5.88x more efficient) compared to published FPGA-based classification-oriented DNN accelerators for both edge and cloud computing cases. We reach 4218 GOPS for running object detection DNN which is the highest throughput reported to the best of our knowledge. DNNBuilder can provide millisecond-scale real-time performance for processing HD video input and deliver higher efficiency (up to 4.35x) than the GPU-based solutions.

AB - Building a high-performance EPGA accelerator for Deep Neural Networks (DNNs) often requires RTL programming, hardware verification, and precise resource allocation, all of which can be time-consuming and challenging to perform even for seasoned FPGA developers. To bridge the gap between fast DNN construction in software (e.g., Caffe, TensorFlow) and slow hardware implementation, we propose DNNBuilder for building high-performance DNN hardware accelerators on FPGAs automatically. Novel techniques are developed to meet the throughput and latency requirements for both cloud- and edge-devices. A number of novel techniques including high-quality RTL neural network components, a fine-grained layer-based pipeline architecture, and a column-based cache scheme are developed to boost throughput, reduce latency, and save FPGA on-chip memory. To address the limited resource challenge, we design an automatic design space exploration tool to generate optimized parallelism guidelines by considering external memory access bandwidth, data reuse behaviors, FPGA resource availability, and DNN complexity. DNNBuilder is demonstrated on four DNNs (Alexnet, ZF, VGG16, and YOLO) on two FPGAs (XC7Z045 and KU115) corresponding to the edge- and cloud-computing, respectively. The fine-grained layer-based pipeline architecture and the column-based cache scheme contribute to 7.7x and 43x reduction of the latency and BRAM utilization compared to conventional designs. We achieve the best performance (up to 5.15x faster) and efficiency (up to 5.88x more efficient) compared to published FPGA-based classification-oriented DNN accelerators for both edge and cloud computing cases. We reach 4218 GOPS for running object detection DNN which is the highest throughput reported to the best of our knowledge. DNNBuilder can provide millisecond-scale real-time performance for processing HD video input and deliver higher efficiency (up to 4.35x) than the GPU-based solutions.

UR - http://www.scopus.com/inward/record.url?scp=85058185331&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85058185331&partnerID=8YFLogxK

U2 - 10.1145/3240765.3240801

DO - 10.1145/3240765.3240801

M3 - Conference contribution

AN - SCOPUS:85058185331

T3 - IEEE/ACM International Conference on Computer-Aided Design, Digest of Technical Papers, ICCAD

BT - 2018 IEEE/ACM International Conference on Computer-Aided Design, ICCAD 2018 - Digest of Technical Papers

PB - Institute of Electrical and Electronics Engineers Inc.

ER -