ADVANCING LLM REASONING GENERALISTS WITH PREFERENCE TREES

  • Lifan Yuan
  • , Ganqu Cui
  • , Hanbin Wang
  • , Ning Ding
  • , Xingyao Wang
  • , Boji Shan
  • , Zeyuan Liu
  • , Jia Deng
  • , Huimin Chen
  • , Ruobing Xie
  • , Yankai Lin
  • , Zhenghao Liu
  • , Bowen Zhou
  • , Hao Peng
  • , Zhiyuan Liu
  • , Maosong Sun

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

We introduce EURUS, a suite of large language models (LLMs) optimized for reasoning. Finetuned from Mistral-7B, Llama-3-8B, and Mixtral-8x22B, EURUS models achieve state-of-the-art results among open-source models on a diverse set of benchmarks covering mathematics, code generation, and logical reasoning problems. Notably, EURUX-8X22B outperforms GPT-3.5 Turbo in reasoning through a comprehensive benchmarking across 12 test sets covering five tasks. The strong performance of EURUS can be primarily attributed to ULTRAINTERACT, our newly-curated large-scale, high-quality training data dataset specifically designed for complex reasoning tasks. ULTRAINTERACT can be used in both supervised fine-tuning, preference learning, and reward modeling. It pairs each instruction with a preference tree consisting of (1) reasoning chains with diverse planning strategies in a unified format, (2) multi-turn interaction trajectories with the environment and the critique, and (3) pairwise positive and negative responses to facilitate preference learning. ULTRAINTERACT allows us to conduct an in-depth exploration of preference learning for reasoning tasks. Our investigation reveals that some well-established preference learning algorithms may be less suitable for reasoning tasks compared to their effectiveness in general conversations. The hypothesis is that in reasoning tasks, the space of correct answers is much smaller than that of incorrect ones, so it is necessary to explicitly increase the reward of chosen data. Therefore, in addition to increasing the reward margin as many preference learning algorithms do, the absolute values of positive responses' rewards should be positive and may serve as a proxy for performance. Inspired by this, we derive a novel reward modeling objective and empirically that it leads to a stable reward modeling curve and better performance. Together with ULTRAINTERACT, we obtain a strong reward model.

Original languageEnglish (US)
Title of host publication13th International Conference on Learning Representations, ICLR 2025
PublisherInternational Conference on Learning Representations, ICLR
Pages3691-3713
Number of pages23
ISBN (Electronic)9798331320850
StatePublished - 2025
Event13th International Conference on Learning Representations, ICLR 2025 - Singapore, Singapore
Duration: Apr 24 2025Apr 28 2025

Publication series

Name13th International Conference on Learning Representations, ICLR 2025

Conference

Conference13th International Conference on Learning Representations, ICLR 2025
Country/TerritorySingapore
CitySingapore
Period4/24/254/28/25

ASJC Scopus subject areas

  • Language and Linguistics
  • Computer Science Applications
  • Education
  • Linguistics and Language

Fingerprint

Dive into the research topics of 'ADVANCING LLM REASONING GENERALISTS WITH PREFERENCE TREES'. Together they form a unique fingerprint.

Cite this