TY - CONF
T1 - ENABLING LANGUAGE MODELS TO IMPLICITLY LEARN SELF-IMPROVEMENT
AU - Wang, Ziqi
AU - Hou, Le
AU - Lu, Tianjian
AU - Wu, Yuexin
AU - Li, Yunxuan
AU - Yu, Hongkun
AU - Ji, Heng
N1 - We would like to acknowledge our Google colleagues for their invaluable advice and support. In particular, we thank Music Li (Yuezhang Li) for insightful discussions and manual evaluation. We thank Tianqi Liu, Honglong Cai, and Albert Webson for their constructive advice, and L\u00E9onard Hussenot and Robert Dadashi for building RLHF infra. Finally, we would like to acknowledge Melvin Johnson, Hongkun Yu, and Denny Zhou for their support throughout the project. We also thank the anonymous reviewers for their suggestions and comments. This research is also based upon work supported by U.S. DARPA ECOLE Program No. HR00112390060 and U.S. DARPA ITM Program No. FA8650-23-C-7316 and KAIROS Program No. FA8750-19-2-1004. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of DARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein.
PY - 2024
Y1 - 2024
N2 - Large Language Models (LLMs) have demonstrated remarkable capabilities in open-ended text generation tasks. However, the inherent open-ended nature of these tasks implies that there is always room for improvement in the quality of model responses. To address this challenge, various approaches have been proposed to enhance the performance of LLMs. There has been a growing focus on enabling LLMs to self-improve their response quality, thereby reducing the reliance on extensive human annotation efforts for collecting diverse and high-quality training data. Recently, prompting-based methods have been widely explored among self-improvement methods owing to their effectiveness, efficiency, and convenience. However, those methods usually require explicitly and thoroughly written rubrics as inputs to LLMs. It is expensive and challenging to manually derive and provide all necessary rubrics with a real-world complex goal for improvement (e.g., being more helpful and less harmful). To this end, we propose an ImPlicit Self-ImprovemenT (PIT) framework that implicitly learns the improvement goal from human preference data. PIT only requires preference data that are used to train reward models without extra human efforts. Specifically, we reformulate the training objective of reinforcement learning from human feedback (RLHF) - instead of maximizing response quality for a given input, we maximize the quality gap of the response conditioned on a reference response. In this way, PIT is implicitly trained with the improvement goal of better aligning with human preferences. Experiments on two real-world datasets and one synthetic dataset show that our method significantly outperforms prompting-based methods.
AB - Large Language Models (LLMs) have demonstrated remarkable capabilities in open-ended text generation tasks. However, the inherent open-ended nature of these tasks implies that there is always room for improvement in the quality of model responses. To address this challenge, various approaches have been proposed to enhance the performance of LLMs. There has been a growing focus on enabling LLMs to self-improve their response quality, thereby reducing the reliance on extensive human annotation efforts for collecting diverse and high-quality training data. Recently, prompting-based methods have been widely explored among self-improvement methods owing to their effectiveness, efficiency, and convenience. However, those methods usually require explicitly and thoroughly written rubrics as inputs to LLMs. It is expensive and challenging to manually derive and provide all necessary rubrics with a real-world complex goal for improvement (e.g., being more helpful and less harmful). To this end, we propose an ImPlicit Self-ImprovemenT (PIT) framework that implicitly learns the improvement goal from human preference data. PIT only requires preference data that are used to train reward models without extra human efforts. Specifically, we reformulate the training objective of reinforcement learning from human feedback (RLHF) - instead of maximizing response quality for a given input, we maximize the quality gap of the response conditioned on a reference response. In this way, PIT is implicitly trained with the improvement goal of better aligning with human preferences. Experiments on two real-world datasets and one synthetic dataset show that our method significantly outperforms prompting-based methods.
UR - http://www.scopus.com/inward/record.url?scp=85200555413&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85200555413&partnerID=8YFLogxK
M3 - Paper
AN - SCOPUS:85200555413
T2 - 12th International Conference on Learning Representations, ICLR 2024
Y2 - 7 May 2024 through 11 May 2024
ER -