Implementing neural machine translation with bi-directional GRU and attention mechanism on FPGAs using HLS

Qin Li, Xiaofan Zhang, Jin Jun Xiong, Wen-Mei W Hwu, Deming Chen

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Neural machine translation (NMT) is a popular topic in Natural Language Processing which uses deep neural networks (DNNs) for translation from source to targeted languages. With the emerging technologies, such as bidirectional Gated Recurrent Units (GRU), attention mechanisms, and beam-search algorithms, NMT can deliver improved translation quality compared to the conventional statistics-based methods, especially for translating long sentences. However, higher translation quality means more complicated models, higher computation/memory demands, and longer translation time, which causes difficulties for practical use. In this paper, we propose a design methodology for implementing the inference of a real-life NMT (with the problem size = 172 GFLOP) on FPGA for improved run time latency and energy efficiency. We use High-Level Synthesis (HLS) to build high-performance parameterized IPs for handling the most basic operations (multiply-accumulations) and construct these IPs to accelerate the matrix-vector multiplication (MVM) kernels, which are frequently used in NMT. Also, we perform a design space exploration by considering both computation resources and memory access bandwidth when utilizing the hardware parallelism in the model and generate the best parameter configurations of the proposed IPs. Accordingly, we propose a novel hybrid parallel structure for accelerating the NMT with affordable resource overhead for the targeted FPGA. Our design is demonstrated on a Xilinx VCU118 with overall performance at 7.16 GFLOPS.

Original languageEnglish (US)
Title of host publicationASP-DAC 2019 - 24th Asia and South Pacific Design Automation Conference
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages693-698
Number of pages6
ISBN (Electronic)9781450360074
DOIs
StatePublished - Jan 21 2019
Event24th Asia and South Pacific Design Automation Conference, ASPDAC 2019 - Tokyo, Japan
Duration: Jan 21 2019Jan 24 2019

Publication series

NameProceedings of the Asia and South Pacific Design Automation Conference, ASP-DAC

Other

Other24th Asia and South Pacific Design Automation Conference, ASPDAC 2019
CountryJapan
CityTokyo
Period1/21/191/24/19

Fingerprint

Field programmable gate arrays (FPGA)
Data storage equipment
Computer hardware
Energy efficiency
Statistics
Bandwidth
Processing
High level synthesis
Deep neural networks

ASJC Scopus subject areas

  • Electrical and Electronic Engineering
  • Computer Science Applications
  • Computer Graphics and Computer-Aided Design

Cite this

Li, Q., Zhang, X., Xiong, J. J., Hwu, W-M. W., & Chen, D. (2019). Implementing neural machine translation with bi-directional GRU and attention mechanism on FPGAs using HLS. In ASP-DAC 2019 - 24th Asia and South Pacific Design Automation Conference (pp. 693-698). (Proceedings of the Asia and South Pacific Design Automation Conference, ASP-DAC). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1145/3287624.3287717

Implementing neural machine translation with bi-directional GRU and attention mechanism on FPGAs using HLS. / Li, Qin; Zhang, Xiaofan; Xiong, Jin Jun; Hwu, Wen-Mei W; Chen, Deming.

ASP-DAC 2019 - 24th Asia and South Pacific Design Automation Conference. Institute of Electrical and Electronics Engineers Inc., 2019. p. 693-698 (Proceedings of the Asia and South Pacific Design Automation Conference, ASP-DAC).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Li, Q, Zhang, X, Xiong, JJ, Hwu, W-MW & Chen, D 2019, Implementing neural machine translation with bi-directional GRU and attention mechanism on FPGAs using HLS. in ASP-DAC 2019 - 24th Asia and South Pacific Design Automation Conference. Proceedings of the Asia and South Pacific Design Automation Conference, ASP-DAC, Institute of Electrical and Electronics Engineers Inc., pp. 693-698, 24th Asia and South Pacific Design Automation Conference, ASPDAC 2019, Tokyo, Japan, 1/21/19. https://doi.org/10.1145/3287624.3287717
Li Q, Zhang X, Xiong JJ, Hwu W-MW, Chen D. Implementing neural machine translation with bi-directional GRU and attention mechanism on FPGAs using HLS. In ASP-DAC 2019 - 24th Asia and South Pacific Design Automation Conference. Institute of Electrical and Electronics Engineers Inc. 2019. p. 693-698. (Proceedings of the Asia and South Pacific Design Automation Conference, ASP-DAC). https://doi.org/10.1145/3287624.3287717
Li, Qin ; Zhang, Xiaofan ; Xiong, Jin Jun ; Hwu, Wen-Mei W ; Chen, Deming. / Implementing neural machine translation with bi-directional GRU and attention mechanism on FPGAs using HLS. ASP-DAC 2019 - 24th Asia and South Pacific Design Automation Conference. Institute of Electrical and Electronics Engineers Inc., 2019. pp. 693-698 (Proceedings of the Asia and South Pacific Design Automation Conference, ASP-DAC).
@inproceedings{5a53018c9586432ebd560956785fb77a,
title = "Implementing neural machine translation with bi-directional GRU and attention mechanism on FPGAs using HLS",
abstract = "Neural machine translation (NMT) is a popular topic in Natural Language Processing which uses deep neural networks (DNNs) for translation from source to targeted languages. With the emerging technologies, such as bidirectional Gated Recurrent Units (GRU), attention mechanisms, and beam-search algorithms, NMT can deliver improved translation quality compared to the conventional statistics-based methods, especially for translating long sentences. However, higher translation quality means more complicated models, higher computation/memory demands, and longer translation time, which causes difficulties for practical use. In this paper, we propose a design methodology for implementing the inference of a real-life NMT (with the problem size = 172 GFLOP) on FPGA for improved run time latency and energy efficiency. We use High-Level Synthesis (HLS) to build high-performance parameterized IPs for handling the most basic operations (multiply-accumulations) and construct these IPs to accelerate the matrix-vector multiplication (MVM) kernels, which are frequently used in NMT. Also, we perform a design space exploration by considering both computation resources and memory access bandwidth when utilizing the hardware parallelism in the model and generate the best parameter configurations of the proposed IPs. Accordingly, we propose a novel hybrid parallel structure for accelerating the NMT with affordable resource overhead for the targeted FPGA. Our design is demonstrated on a Xilinx VCU118 with overall performance at 7.16 GFLOPS.",
author = "Qin Li and Xiaofan Zhang and Xiong, {Jin Jun} and Hwu, {Wen-Mei W} and Deming Chen",
year = "2019",
month = "1",
day = "21",
doi = "10.1145/3287624.3287717",
language = "English (US)",
series = "Proceedings of the Asia and South Pacific Design Automation Conference, ASP-DAC",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
pages = "693--698",
booktitle = "ASP-DAC 2019 - 24th Asia and South Pacific Design Automation Conference",
address = "United States",

}

TY - GEN

T1 - Implementing neural machine translation with bi-directional GRU and attention mechanism on FPGAs using HLS

AU - Li, Qin

AU - Zhang, Xiaofan

AU - Xiong, Jin Jun

AU - Hwu, Wen-Mei W

AU - Chen, Deming

PY - 2019/1/21

Y1 - 2019/1/21

N2 - Neural machine translation (NMT) is a popular topic in Natural Language Processing which uses deep neural networks (DNNs) for translation from source to targeted languages. With the emerging technologies, such as bidirectional Gated Recurrent Units (GRU), attention mechanisms, and beam-search algorithms, NMT can deliver improved translation quality compared to the conventional statistics-based methods, especially for translating long sentences. However, higher translation quality means more complicated models, higher computation/memory demands, and longer translation time, which causes difficulties for practical use. In this paper, we propose a design methodology for implementing the inference of a real-life NMT (with the problem size = 172 GFLOP) on FPGA for improved run time latency and energy efficiency. We use High-Level Synthesis (HLS) to build high-performance parameterized IPs for handling the most basic operations (multiply-accumulations) and construct these IPs to accelerate the matrix-vector multiplication (MVM) kernels, which are frequently used in NMT. Also, we perform a design space exploration by considering both computation resources and memory access bandwidth when utilizing the hardware parallelism in the model and generate the best parameter configurations of the proposed IPs. Accordingly, we propose a novel hybrid parallel structure for accelerating the NMT with affordable resource overhead for the targeted FPGA. Our design is demonstrated on a Xilinx VCU118 with overall performance at 7.16 GFLOPS.

AB - Neural machine translation (NMT) is a popular topic in Natural Language Processing which uses deep neural networks (DNNs) for translation from source to targeted languages. With the emerging technologies, such as bidirectional Gated Recurrent Units (GRU), attention mechanisms, and beam-search algorithms, NMT can deliver improved translation quality compared to the conventional statistics-based methods, especially for translating long sentences. However, higher translation quality means more complicated models, higher computation/memory demands, and longer translation time, which causes difficulties for practical use. In this paper, we propose a design methodology for implementing the inference of a real-life NMT (with the problem size = 172 GFLOP) on FPGA for improved run time latency and energy efficiency. We use High-Level Synthesis (HLS) to build high-performance parameterized IPs for handling the most basic operations (multiply-accumulations) and construct these IPs to accelerate the matrix-vector multiplication (MVM) kernels, which are frequently used in NMT. Also, we perform a design space exploration by considering both computation resources and memory access bandwidth when utilizing the hardware parallelism in the model and generate the best parameter configurations of the proposed IPs. Accordingly, we propose a novel hybrid parallel structure for accelerating the NMT with affordable resource overhead for the targeted FPGA. Our design is demonstrated on a Xilinx VCU118 with overall performance at 7.16 GFLOPS.

UR - http://www.scopus.com/inward/record.url?scp=85061158804&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85061158804&partnerID=8YFLogxK

U2 - 10.1145/3287624.3287717

DO - 10.1145/3287624.3287717

M3 - Conference contribution

T3 - Proceedings of the Asia and South Pacific Design Automation Conference, ASP-DAC

SP - 693

EP - 698

BT - ASP-DAC 2019 - 24th Asia and South Pacific Design Automation Conference

PB - Institute of Electrical and Electronics Engineers Inc.

ER -