Is Best-of-N the Best of Them? Coverage, Scaling, and Optimality in Inference-Time Alignment

  • Audrey Huang
  • , Adam Block
  • , Qinghua Liu
  • , Nan Jiang
  • , Akshay Krishnamurthy
  • , Dylan J. Foster

Research output: Contribution to journalConference articlepeer-review

Abstract

Recent work on inference-time alignment has established the benefits of increasing inferencetime computation in language models, but naively scaling compute through techniques like Best-of-N sampling can cause performance to degrade due to reward hacking. Toward a theoretical understanding of how to best leverage additional computation, we formalize inference-time alignment as improving a pre-trained policys responses for a prompt of interest, given access to an imperfect reward model. We analyze the performance of inference-time alignment algorithms in terms of (i) response quality, and (ii) compute, and provide new results that highlight the importance of the pre-trained policys coverage over high-quality responses for performance and compute scaling: (1) We show that Best-of-N alignment with an ideal N can achieve optimal performance under stringent notions of coverage, but provably suffers from reward hacking when N is large, and fails to achieve tight guarantees under more realistic coverage conditions; (2) We introduce InferenceTimePessimism, a new algorithm which mitigates reward hacking through deliberate use of inference-time compute, implementing pessimism in the face of uncertainty; we prove that its performance is optimal and scalingmonotonic, i.e., does not degrade as N increases.We complement our theoretical results with experiments that demonstrate the practicality of our algorithm across a variety of tasks and models.

Original languageEnglish (US)
Pages (from-to)25075-25126
Number of pages52
JournalProceedings of Machine Learning Research
Volume267
StatePublished - 2025
Event42nd International Conference on Machine Learning, ICML 2025 - Vancouver, Canada
Duration: Jul 13 2025Jul 19 2025

ASJC Scopus subject areas

  • Software
  • Control and Systems Engineering
  • Statistics and Probability
  • Artificial Intelligence

Fingerprint

Dive into the research topics of 'Is Best-of-N the Best of Them? Coverage, Scaling, and Optimality in Inference-Time Alignment'. Together they form a unique fingerprint.

Cite this