TY - GEN
T1 - XTC
T2 - 36th Conference on Neural Information Processing Systems, NeurIPS 2022
AU - Wu, Xiaoxia
AU - Yao, Zhewei
AU - Zhang, Minjia
AU - Li, Conglong
AU - He, Yuxiong
N1 - This work is done within the DeepSpeed team in Microsoft. We appreciate the helps from the DeepSpeed team. Particularly, we thank Jeff Rasley for coming up the name of our method and Elton Zheng for solving the engineering issue. We thank the engineering supports from the Turing team in Microsoft.
PY - 2022
Y1 - 2022
N2 - Extreme compression, particularly ultra-low bit precision (binary/ternary) quantization, has been proposed to fit large NLP models on resource-constrained devices. However, to preserve the accuracy for such aggressive compression schemes, cutting-edge methods usually introduce complicated compression pipelines, e.g., multi-stage expensive knowledge distillation with extensive hyperparameter tuning. Also, they oftentimes focus less on smaller transformer models that have already been heavily compressed via knowledge distillation and lack a systematic study to show the effectiveness of their methods. In this paper, we perform a very comprehensive systematic study to measure the impact of many key hyperparameters and training strategies from previous works. As a result, we find out that previous baselines for ultra-low bit precision quantization are significantly under-trained. Based on our study, we propose a simple yet effective compression pipeline for extreme compression, named XTC. XTC demonstrates that (1) we can skip the pre-training knowledge distillation to obtain a 5-layer BERT while achieving better performance than previous state-of-the-art methods, e.g., the 6-layer TinyBERT; (2) extreme quantization plus layer reduction is able to reduce the model size by 50x, resulting in new state-of-the-art results on GLUE tasks.
AB - Extreme compression, particularly ultra-low bit precision (binary/ternary) quantization, has been proposed to fit large NLP models on resource-constrained devices. However, to preserve the accuracy for such aggressive compression schemes, cutting-edge methods usually introduce complicated compression pipelines, e.g., multi-stage expensive knowledge distillation with extensive hyperparameter tuning. Also, they oftentimes focus less on smaller transformer models that have already been heavily compressed via knowledge distillation and lack a systematic study to show the effectiveness of their methods. In this paper, we perform a very comprehensive systematic study to measure the impact of many key hyperparameters and training strategies from previous works. As a result, we find out that previous baselines for ultra-low bit precision quantization are significantly under-trained. Based on our study, we propose a simple yet effective compression pipeline for extreme compression, named XTC. XTC demonstrates that (1) we can skip the pre-training knowledge distillation to obtain a 5-layer BERT while achieving better performance than previous state-of-the-art methods, e.g., the 6-layer TinyBERT; (2) extreme quantization plus layer reduction is able to reduce the model size by 50x, resulting in new state-of-the-art results on GLUE tasks.
UR - http://www.scopus.com/inward/record.url?scp=85147334872&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85147334872&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85147334872
T3 - Advances in Neural Information Processing Systems
BT - Advances in Neural Information Processing Systems 35 - 36th Conference on Neural Information Processing Systems, NeurIPS 2022
A2 - Koyejo, S.
A2 - Mohamed, S.
A2 - Agarwal, A.
A2 - Belgrave, D.
A2 - Cho, K.
A2 - Oh, A.
PB - Neural information processing systems foundation
Y2 - 28 November 2022 through 9 December 2022
ER -