Speeding Up Transformer Decoding via an Attention Refinement Network

Kaixin Wu, Yue Zhang, Bojie Hu, Tong Zhang

Research output: Contribution to journalConference articlepeer-review

Abstract

Despite the revolutionary advances made by Transformer in Neural Machine Translation (NMT), inference efficiency remains an obstacle due to the heavy use of attention operations in auto-regressive decoding. We thereby propose a lightweight attention structure called Attention Refinement Network (ARN) for speeding up Transformer. Specifically, we design a weighted residual network, which reconstructs the attention by reusing the features across layers. To further improve the Transformer efficiency, we merge the self-attention and cross-attention components for parallel computing. Extensive experiments on ten WMT machine translation tasks show that the proposed model yields an average of 1.35× faster (with almost no decrease in BLEU) over the state-of-the-art inference implementation.

Original languageEnglish (US)
Pages (from-to)5109-5118
Number of pages10
JournalProceedings - International Conference on Computational Linguistics, COLING
Volume29
Issue number1
StatePublished - 2022
Externally publishedYes
Event29th International Conference on Computational Linguistics, COLING 2022 - Gyeongju, Korea, Republic of
Duration: Oct 12 2022Oct 17 2022

ASJC Scopus subject areas

  • Computational Theory and Mathematics
  • Computer Science Applications
  • Theoretical Computer Science

Fingerprint

Dive into the research topics of 'Speeding Up Transformer Decoding via an Attention Refinement Network'. Together they form a unique fingerprint.

Cite this