Abstract
Recent work on inference-time alignment has established the benefits of increasing inferencetime computation in language models, but naively scaling compute through techniques like Best-of-N sampling can cause performance to degrade due to reward hacking. Toward a theoretical understanding of how to best leverage additional computation, we formalize inference-time alignment as improving a pre-trained policys responses for a prompt of interest, given access to an imperfect reward model. We analyze the performance of inference-time alignment algorithms in terms of (i) response quality, and (ii) compute, and provide new results that highlight the importance of the pre-trained policys coverage over high-quality responses for performance and compute scaling: (1) We show that Best-of-N alignment with an ideal N can achieve optimal performance under stringent notions of coverage, but provably suffers from reward hacking when N is large, and fails to achieve tight guarantees under more realistic coverage conditions; (2) We introduce InferenceTimePessimism, a new algorithm which mitigates reward hacking through deliberate use of inference-time compute, implementing pessimism in the face of uncertainty; we prove that its performance is optimal and scalingmonotonic, i.e., does not degrade as N increases.We complement our theoretical results with experiments that demonstrate the practicality of our algorithm across a variety of tasks and models.
| Original language | English (US) |
|---|---|
| Pages (from-to) | 25075-25126 |
| Number of pages | 52 |
| Journal | Proceedings of Machine Learning Research |
| Volume | 267 |
| State | Published - 2025 |
| Event | 42nd International Conference on Machine Learning, ICML 2025 - Vancouver, Canada Duration: Jul 13 2025 → Jul 19 2025 |
ASJC Scopus subject areas
- Software
- Control and Systems Engineering
- Statistics and Probability
- Artificial Intelligence