Abstract
Despite the revolutionary advances made by Transformer in Neural Machine Translation (NMT), inference efficiency remains an obstacle due to the heavy use of attention operations in auto-regressive decoding. We thereby propose a lightweight attention structure called Attention Refinement Network (ARN) for speeding up Transformer. Specifically, we design a weighted residual network, which reconstructs the attention by reusing the features across layers. To further improve the Transformer efficiency, we merge the self-attention and cross-attention components for parallel computing. Extensive experiments on ten WMT machine translation tasks show that the proposed model yields an average of 1.35× faster (with almost no decrease in BLEU) over the state-of-the-art inference implementation.
Original language | English (US) |
---|---|
Pages (from-to) | 5109-5118 |
Number of pages | 10 |
Journal | Proceedings - International Conference on Computational Linguistics, COLING |
Volume | 29 |
Issue number | 1 |
State | Published - 2022 |
Externally published | Yes |
Event | 29th International Conference on Computational Linguistics, COLING 2022 - Gyeongju, Korea, Republic of Duration: Oct 12 2022 → Oct 17 2022 |
ASJC Scopus subject areas
- Computational Theory and Mathematics
- Computer Science Applications
- Theoretical Computer Science