TY - CONF
T1 - Random Feature Attention
AU - Peng, Hao
AU - Pappas, Nikolaos
AU - Yogatama, Dani
AU - Schwartz, Roy
AU - Smith, Noah A.
AU - Kong, Lingpeng
N1 - Funding Information:
We would like to thank Phil Blunsom, Chris Dyer, Nando de Freitas, Jungo Kasai, Adhiguna Kun-coro, Dianqi Li, Ofir Press, Lianhui Qin, Swabha Swayamdipta, Sam Thomson, the language team at DeepMind and the ARK group at the University of Washington for their helpful feedback. We also thank Tay Yi for helping run the Long Range Arena experiments, Richard Tanburn for the advice on implementations, and the anonymous reviewers for their thoughtful comments. This work was supported in part by NSF grant 1562364 and a Google Fellowship. Nikolaos Pappas was supported by the Swiss National Science Foundation under grant number P400P2 183911 “UNISON.”
Publisher Copyright:
© 2021 ICLR 2021 - 9th International Conference on Learning Representations. All rights reserved.
PY - 2021
Y1 - 2021
N2 - Transformers are state-of-the-art models for a variety of sequence modeling tasks. At their core is an attention function which models pairwise interactions between the inputs at every timestep. While attention is powerful, it does not scale efficiently to long sequences due to its quadratic time and space complexity in the sequence length. We propose RFA, a linear time and space attention that uses random feature methods to approximate the softmax function, and explore its application in transformers. RFA can be used as a drop-in replacement for conventional softmax attention and offers a straightforward way of learning with recency bias through an optional gating mechanism. Experiments on language modeling and machine translation demonstrate that RFA achieves similar or better performance compared to strong transformer baselines. In the machine translation experiment, RFA decodes twice as fast as a vanilla transformer. Compared to existing efficient transformer variants, RFA is competitive in terms of both accuracy and efficiency on three long text classification datasets. Our analysis shows that RFA's efficiency gains are especially notable on long sequences, suggesting that RFA will be particularly useful in tasks that require working with large inputs, fast decoding speed, or low memory footprints.
AB - Transformers are state-of-the-art models for a variety of sequence modeling tasks. At their core is an attention function which models pairwise interactions between the inputs at every timestep. While attention is powerful, it does not scale efficiently to long sequences due to its quadratic time and space complexity in the sequence length. We propose RFA, a linear time and space attention that uses random feature methods to approximate the softmax function, and explore its application in transformers. RFA can be used as a drop-in replacement for conventional softmax attention and offers a straightforward way of learning with recency bias through an optional gating mechanism. Experiments on language modeling and machine translation demonstrate that RFA achieves similar or better performance compared to strong transformer baselines. In the machine translation experiment, RFA decodes twice as fast as a vanilla transformer. Compared to existing efficient transformer variants, RFA is competitive in terms of both accuracy and efficiency on three long text classification datasets. Our analysis shows that RFA's efficiency gains are especially notable on long sequences, suggesting that RFA will be particularly useful in tasks that require working with large inputs, fast decoding speed, or low memory footprints.
UR - http://www.scopus.com/inward/record.url?scp=85127399183&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85127399183&partnerID=8YFLogxK
M3 - Paper
AN - SCOPUS:85127399183
T2 - 9th International Conference on Learning Representations, ICLR 2021
Y2 - 3 May 2021 through 7 May 2021
ER -