TY - JOUR
T1 - Online Iterative Reinforcement Learning from Human Feedback with General Preference Model
AU - Ye, Chenlu
AU - Xiong, Wei
AU - Zhang, Yuheng
AU - Dong, Hanze
AU - Jiang, Nan
AU - Zhang, Tong
N1 - The authors would like to thank Tianqi Liu for insightful discussions on the training of the preference model, and thank Haoxiang Wang, and Zihao Li for valuable discussions on the preference dataset selection. We also thank Nevena Lazic and Csaba Szepesvari for pointing out a technical gap in the first version. Wei Xiong and Tong Zhang are partially supported by an NSF IIS grant No. 2416897 and Tong Zhang is partially supported by an NSF IIS grant No. 2416897. Nan Jiang acknowledges funding support from NSF IIS-2112471, NSF CAREER IIS-2141781, Google Scholar Award, and Sloan Fellowship.
PY - 2024
Y1 - 2024
N2 - We investigate Reinforcement Learning from Human Feedback (RLHF) in the context of a general preference oracle. In particular, we do not assume the existence of a reward function and an oracle preference signal drawn from the Bradley-Terry model as most of the prior works do. We consider a standard mathematical formulation, the reverse-KL regularized minimax game between two LLMs for RLHF under general preference oracle. The learning objective of this formulation is to find a policy so that it is consistently preferred by the KL-regularized preference oracle over any competing LLMs. We show that this framework is strictly more general than the reward-based one, and propose sample-efficient algorithms for both the offline learning from a pre-collected preference dataset and online learning where we can query the preference oracle along the way of training. Empirical studies verify the effectiveness of the proposed framework.
AB - We investigate Reinforcement Learning from Human Feedback (RLHF) in the context of a general preference oracle. In particular, we do not assume the existence of a reward function and an oracle preference signal drawn from the Bradley-Terry model as most of the prior works do. We consider a standard mathematical formulation, the reverse-KL regularized minimax game between two LLMs for RLHF under general preference oracle. The learning objective of this formulation is to find a policy so that it is consistently preferred by the KL-regularized preference oracle over any competing LLMs. We show that this framework is strictly more general than the reward-based one, and propose sample-efficient algorithms for both the offline learning from a pre-collected preference dataset and online learning where we can query the preference oracle along the way of training. Empirical studies verify the effectiveness of the proposed framework.
UR - http://www.scopus.com/inward/record.url?scp=105000506108&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=105000506108&partnerID=8YFLogxK
M3 - Conference article
AN - SCOPUS:105000506108
SN - 1049-5258
VL - 37
JO - Advances in Neural Information Processing Systems
JF - Advances in Neural Information Processing Systems
T2 - 38th Conference on Neural Information Processing Systems, NeurIPS 2024
Y2 - 9 December 2024 through 15 December 2024
ER -