TY - GEN
T1 - XFT
T2 - 62nd Annual Meeting of the Association for Computational Linguistics, ACL 2024
AU - Ding, Yifeng
AU - Liu, Jiawei
AU - Wei, Yuxiang
AU - Zhang, Lingming
N1 - We extend our special thanks to Terry Yue Zhuo for his assistance with the scale-up experiments on DeepSeek-Coder-Base 6.7B (\u00A76.1) after our submission. His contributions are good enough to merit authorship; however, due to the policy of ACL 2024, post-submission authorship changes are not permitted. As a result, we have included him in the author list of our arXiv version. We also thank Sea AI Lab and Dr. Qian Liu for their valuable feedback and computing resource assistance. We appreciate all the reviewers for their insightful comments. This work was partially supported by NSF grant CCF-2131943, as well as Kwai Inc.
PY - 2024
Y1 - 2024
N2 - We introduce XFT, a simple yet powerful training scheme, by simply merging upcycled Mixture-of-Experts (MoE) to unleash the performance limit of instruction-tuned code Large Language Models (LLMs). While vanilla sparse upcycling fails to improve instruction tuning, XFT introduces a shared expert mechanism with a novel routing weight normalization strategy into sparse upcycling, which significantly boosts instruction tuning. After fine-tuning the upcycled MoE model, XFT introduces a learnable model merging mechanism to compile the upcycled MoE model back to a dense model, achieving upcycled MoE-level performance with only dense-model compute. By applying XFT to a 1.3B model, we create a new state-of-the-art tiny code LLM (<3B) with 67.1 and 64.6 pass@1 on HumanEval and HumanEval+ respectively. With the same data and model architecture, XFT improves supervised fine-tuning (SFT) by 13% on HumanEval+, along with consistent improvements from 2% to 13% on MBPP+, MultiPL-E, and DS-1000, demonstrating its generalizability. XFT is fully orthogonal to existing techniques such as Evol-Instruct and OSS-INSTRUCT, opening a new dimension for improving code instruction tuning. Codes are available at https://github.com/ise-uiuc/xft.
AB - We introduce XFT, a simple yet powerful training scheme, by simply merging upcycled Mixture-of-Experts (MoE) to unleash the performance limit of instruction-tuned code Large Language Models (LLMs). While vanilla sparse upcycling fails to improve instruction tuning, XFT introduces a shared expert mechanism with a novel routing weight normalization strategy into sparse upcycling, which significantly boosts instruction tuning. After fine-tuning the upcycled MoE model, XFT introduces a learnable model merging mechanism to compile the upcycled MoE model back to a dense model, achieving upcycled MoE-level performance with only dense-model compute. By applying XFT to a 1.3B model, we create a new state-of-the-art tiny code LLM (<3B) with 67.1 and 64.6 pass@1 on HumanEval and HumanEval+ respectively. With the same data and model architecture, XFT improves supervised fine-tuning (SFT) by 13% on HumanEval+, along with consistent improvements from 2% to 13% on MBPP+, MultiPL-E, and DS-1000, demonstrating its generalizability. XFT is fully orthogonal to existing techniques such as Evol-Instruct and OSS-INSTRUCT, opening a new dimension for improving code instruction tuning. Codes are available at https://github.com/ise-uiuc/xft.
UR - http://www.scopus.com/inward/record.url?scp=85204464452&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85204464452&partnerID=8YFLogxK
U2 - 10.18653/v1/2024.acl-long.699
DO - 10.18653/v1/2024.acl-long.699
M3 - Conference contribution
AN - SCOPUS:85204464452
T3 - Proceedings of the Annual Meeting of the Association for Computational Linguistics
SP - 12941
EP - 12955
BT - Long Papers
A2 - Ku, Lun-Wei
A2 - Martins, Andre F. T.
A2 - Srikumar, Vivek
PB - Association for Computational Linguistics (ACL)
Y2 - 11 August 2024 through 16 August 2024
ER -