TY - GEN
T1 - Implementing neural machine translation with bi-directional GRU and attention mechanism on FPGAs using HLS
AU - Li, Qin
AU - Zhang, Xiaofan
AU - Xiong, Jin Jun
AU - Hwu, Wen Mei
AU - Chen, Deming
N1 - Publisher Copyright:
© 2019 Association for Computing Machinery.
PY - 2019/1/21
Y1 - 2019/1/21
N2 - Neural machine translation (NMT) is a popular topic in Natural Language Processing which uses deep neural networks (DNNs) for translation from source to targeted languages. With the emerging technologies, such as bidirectional Gated Recurrent Units (GRU), attention mechanisms, and beam-search algorithms, NMT can deliver improved translation quality compared to the conventional statistics-based methods, especially for translating long sentences. However, higher translation quality means more complicated models, higher computation/memory demands, and longer translation time, which causes difficulties for practical use. In this paper, we propose a design methodology for implementing the inference of a real-life NMT (with the problem size = 172 GFLOP) on FPGA for improved run time latency and energy efficiency. We use High-Level Synthesis (HLS) to build high-performance parameterized IPs for handling the most basic operations (multiply-accumulations) and construct these IPs to accelerate the matrix-vector multiplication (MVM) kernels, which are frequently used in NMT. Also, we perform a design space exploration by considering both computation resources and memory access bandwidth when utilizing the hardware parallelism in the model and generate the best parameter configurations of the proposed IPs. Accordingly, we propose a novel hybrid parallel structure for accelerating the NMT with affordable resource overhead for the targeted FPGA. Our design is demonstrated on a Xilinx VCU118 with overall performance at 7.16 GFLOPS.
AB - Neural machine translation (NMT) is a popular topic in Natural Language Processing which uses deep neural networks (DNNs) for translation from source to targeted languages. With the emerging technologies, such as bidirectional Gated Recurrent Units (GRU), attention mechanisms, and beam-search algorithms, NMT can deliver improved translation quality compared to the conventional statistics-based methods, especially for translating long sentences. However, higher translation quality means more complicated models, higher computation/memory demands, and longer translation time, which causes difficulties for practical use. In this paper, we propose a design methodology for implementing the inference of a real-life NMT (with the problem size = 172 GFLOP) on FPGA for improved run time latency and energy efficiency. We use High-Level Synthesis (HLS) to build high-performance parameterized IPs for handling the most basic operations (multiply-accumulations) and construct these IPs to accelerate the matrix-vector multiplication (MVM) kernels, which are frequently used in NMT. Also, we perform a design space exploration by considering both computation resources and memory access bandwidth when utilizing the hardware parallelism in the model and generate the best parameter configurations of the proposed IPs. Accordingly, we propose a novel hybrid parallel structure for accelerating the NMT with affordable resource overhead for the targeted FPGA. Our design is demonstrated on a Xilinx VCU118 with overall performance at 7.16 GFLOPS.
UR - http://www.scopus.com/inward/record.url?scp=85061158804&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85061158804&partnerID=8YFLogxK
U2 - 10.1145/3287624.3287717
DO - 10.1145/3287624.3287717
M3 - Conference contribution
AN - SCOPUS:85061158804
T3 - Proceedings of the Asia and South Pacific Design Automation Conference, ASP-DAC
SP - 693
EP - 698
BT - ASP-DAC 2019 - 24th Asia and South Pacific Design Automation Conference
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 24th Asia and South Pacific Design Automation Conference, ASPDAC 2019
Y2 - 21 January 2019 through 24 January 2019
ER -