TY - GEN
T1 - On the Transformer Growth for Progressive BERT Training
AU - Gu, Xiaotao
AU - Liu, Liyuan
AU - Yu, Hongkun
AU - Li, Jing
AU - Chen, Chen
AU - Han, Jiawei
N1 - Publisher Copyright:
© 2021 Association for Computational Linguistics.
PY - 2021
Y1 - 2021
N2 - Due to the excessive cost of large-scale language model pre-training, considerable efforts have been made to train BERT progressively—start from an inferior but low-cost model and gradually grow the model to increase the computational complexity. Our objective is to advance the understanding of Transformer growth and discover principles that guide progressive training. First, we find that similar to network architecture search, Transformer growth also favors compound scaling. Specifically, while existing methods only conduct network growth in a single dimension, we observe that it is beneficial to use compound growth operators and balance multiple dimensions (e.g., depth, width, and input length of the model). Moreover, we explore alternative growth operators in each dimension via controlled comparison to give operator selection practical guidance. In light of our analyses, the proposed method Compound Grow speeds up BERT pretraining by 73.6% and 82.2% for the base and large models respectively, while achieving comparable performances.
AB - Due to the excessive cost of large-scale language model pre-training, considerable efforts have been made to train BERT progressively—start from an inferior but low-cost model and gradually grow the model to increase the computational complexity. Our objective is to advance the understanding of Transformer growth and discover principles that guide progressive training. First, we find that similar to network architecture search, Transformer growth also favors compound scaling. Specifically, while existing methods only conduct network growth in a single dimension, we observe that it is beneficial to use compound growth operators and balance multiple dimensions (e.g., depth, width, and input length of the model). Moreover, we explore alternative growth operators in each dimension via controlled comparison to give operator selection practical guidance. In light of our analyses, the proposed method Compound Grow speeds up BERT pretraining by 73.6% and 82.2% for the base and large models respectively, while achieving comparable performances.
UR - http://www.scopus.com/inward/record.url?scp=85118016335&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85118016335&partnerID=8YFLogxK
U2 - 10.18653/v1/2021.naacl-main.406
DO - 10.18653/v1/2021.naacl-main.406
M3 - Conference contribution
AN - SCOPUS:85118016335
T3 - NAACL-HLT 2021 - 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference
SP - 5174
EP - 5180
BT - NAACL-HLT 2021 - 2021 Conference of the North American Chapter of the Association for Computational Linguistics
PB - Association for Computational Linguistics (ACL)
T2 - 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021
Y2 - 6 June 2021 through 11 June 2021
ER -