Off-policy evaluation and learning from logged bandit feedback: Error reduction via surrogate policy

Yuan Xie, Qiang Liu, Yuan Zhou, Boyi Liu, Zhaoran Wang, Jian Peng

Research output: Contribution to conferencePaper

Abstract

When learning from a batch of logged bandit feedback, the discrepancy between the policy to be learned and the off-policy training data imposes statistical and computational challenges. Unlike classical supervised learning and online learning settings, in batch contextual bandit learning, one only has access to a collection of logged feedback from the actions taken by a historical policy, and expect to learn a policy that takes good actions in possibly unseen contexts. Such a batch learning setting is ubiquitous in online and interactive systems, such as ad platforms and recommendation systems. Existing approaches based on inverse propensity weights, such as Inverse Propensity Scoring (IPS) and Policy Optimizer for Exponential Models (POEM), enjoy unbiasedness but often suffer from large mean squared error. In this work, we introduce a new approach named Maximum Likelihood Inverse Propensity Scoring (MLIPS) for batch learning from logged bandit feedback. Instead of using the given historical policy as the proposal in inverse propensity weights, we estimate a maximum likelihood surrogate policy based on the logged action-context pairs, and then use this surrogate policy as the proposal. We prove that MLIPS is asymptotically unbiased, and moreover, has a smaller nonasymptotic mean squared error than IPS. Such an error reduction phenomenon is somewhat surprising as the estimated surrogate policy is less accurate than the given historical policy. Results on multi-label classification problems and a large-scale ad placement dataset demonstrate the empirical effectiveness of MLIPS. Furthermore, the proposed surrogate policy technique is complementary to existing error reduction techniques, and when combined, is able to consistently boost the performance of several widely used approaches.

Original languageEnglish (US)
StatePublished - Jan 1 2019
Event7th International Conference on Learning Representations, ICLR 2019 - New Orleans, United States
Duration: May 6 2019May 9 2019

Conference

Conference7th International Conference on Learning Representations, ICLR 2019
CountryUnited States
CityNew Orleans
Period5/6/195/9/19

Fingerprint

Maximum likelihood
Feedback
evaluation
learning
Recommender systems
Supervised learning
Labels
Evaluation
Propensity
performance
Scoring
Maximum Likelihood

ASJC Scopus subject areas

  • Education
  • Computer Science Applications
  • Linguistics and Language
  • Language and Linguistics

Cite this

Xie, Y., Liu, Q., Zhou, Y., Liu, B., Wang, Z., & Peng, J. (2019). Off-policy evaluation and learning from logged bandit feedback: Error reduction via surrogate policy. Paper presented at 7th International Conference on Learning Representations, ICLR 2019, New Orleans, United States.

Off-policy evaluation and learning from logged bandit feedback : Error reduction via surrogate policy. / Xie, Yuan; Liu, Qiang; Zhou, Yuan; Liu, Boyi; Wang, Zhaoran; Peng, Jian.

2019. Paper presented at 7th International Conference on Learning Representations, ICLR 2019, New Orleans, United States.

Research output: Contribution to conferencePaper

Xie, Y, Liu, Q, Zhou, Y, Liu, B, Wang, Z & Peng, J 2019, 'Off-policy evaluation and learning from logged bandit feedback: Error reduction via surrogate policy' Paper presented at 7th International Conference on Learning Representations, ICLR 2019, New Orleans, United States, 5/6/19 - 5/9/19, .
Xie Y, Liu Q, Zhou Y, Liu B, Wang Z, Peng J. Off-policy evaluation and learning from logged bandit feedback: Error reduction via surrogate policy. 2019. Paper presented at 7th International Conference on Learning Representations, ICLR 2019, New Orleans, United States.
Xie, Yuan ; Liu, Qiang ; Zhou, Yuan ; Liu, Boyi ; Wang, Zhaoran ; Peng, Jian. / Off-policy evaluation and learning from logged bandit feedback : Error reduction via surrogate policy. Paper presented at 7th International Conference on Learning Representations, ICLR 2019, New Orleans, United States.
@conference{4254c0369ee243268d16c499bbbf217f,
title = "Off-policy evaluation and learning from logged bandit feedback: Error reduction via surrogate policy",
abstract = "When learning from a batch of logged bandit feedback, the discrepancy between the policy to be learned and the off-policy training data imposes statistical and computational challenges. Unlike classical supervised learning and online learning settings, in batch contextual bandit learning, one only has access to a collection of logged feedback from the actions taken by a historical policy, and expect to learn a policy that takes good actions in possibly unseen contexts. Such a batch learning setting is ubiquitous in online and interactive systems, such as ad platforms and recommendation systems. Existing approaches based on inverse propensity weights, such as Inverse Propensity Scoring (IPS) and Policy Optimizer for Exponential Models (POEM), enjoy unbiasedness but often suffer from large mean squared error. In this work, we introduce a new approach named Maximum Likelihood Inverse Propensity Scoring (MLIPS) for batch learning from logged bandit feedback. Instead of using the given historical policy as the proposal in inverse propensity weights, we estimate a maximum likelihood surrogate policy based on the logged action-context pairs, and then use this surrogate policy as the proposal. We prove that MLIPS is asymptotically unbiased, and moreover, has a smaller nonasymptotic mean squared error than IPS. Such an error reduction phenomenon is somewhat surprising as the estimated surrogate policy is less accurate than the given historical policy. Results on multi-label classification problems and a large-scale ad placement dataset demonstrate the empirical effectiveness of MLIPS. Furthermore, the proposed surrogate policy technique is complementary to existing error reduction techniques, and when combined, is able to consistently boost the performance of several widely used approaches.",
author = "Yuan Xie and Qiang Liu and Yuan Zhou and Boyi Liu and Zhaoran Wang and Jian Peng",
year = "2019",
month = "1",
day = "1",
language = "English (US)",
note = "7th International Conference on Learning Representations, ICLR 2019 ; Conference date: 06-05-2019 Through 09-05-2019",

}

TY - CONF

T1 - Off-policy evaluation and learning from logged bandit feedback

T2 - Error reduction via surrogate policy

AU - Xie, Yuan

AU - Liu, Qiang

AU - Zhou, Yuan

AU - Liu, Boyi

AU - Wang, Zhaoran

AU - Peng, Jian

PY - 2019/1/1

Y1 - 2019/1/1

N2 - When learning from a batch of logged bandit feedback, the discrepancy between the policy to be learned and the off-policy training data imposes statistical and computational challenges. Unlike classical supervised learning and online learning settings, in batch contextual bandit learning, one only has access to a collection of logged feedback from the actions taken by a historical policy, and expect to learn a policy that takes good actions in possibly unseen contexts. Such a batch learning setting is ubiquitous in online and interactive systems, such as ad platforms and recommendation systems. Existing approaches based on inverse propensity weights, such as Inverse Propensity Scoring (IPS) and Policy Optimizer for Exponential Models (POEM), enjoy unbiasedness but often suffer from large mean squared error. In this work, we introduce a new approach named Maximum Likelihood Inverse Propensity Scoring (MLIPS) for batch learning from logged bandit feedback. Instead of using the given historical policy as the proposal in inverse propensity weights, we estimate a maximum likelihood surrogate policy based on the logged action-context pairs, and then use this surrogate policy as the proposal. We prove that MLIPS is asymptotically unbiased, and moreover, has a smaller nonasymptotic mean squared error than IPS. Such an error reduction phenomenon is somewhat surprising as the estimated surrogate policy is less accurate than the given historical policy. Results on multi-label classification problems and a large-scale ad placement dataset demonstrate the empirical effectiveness of MLIPS. Furthermore, the proposed surrogate policy technique is complementary to existing error reduction techniques, and when combined, is able to consistently boost the performance of several widely used approaches.

AB - When learning from a batch of logged bandit feedback, the discrepancy between the policy to be learned and the off-policy training data imposes statistical and computational challenges. Unlike classical supervised learning and online learning settings, in batch contextual bandit learning, one only has access to a collection of logged feedback from the actions taken by a historical policy, and expect to learn a policy that takes good actions in possibly unseen contexts. Such a batch learning setting is ubiquitous in online and interactive systems, such as ad platforms and recommendation systems. Existing approaches based on inverse propensity weights, such as Inverse Propensity Scoring (IPS) and Policy Optimizer for Exponential Models (POEM), enjoy unbiasedness but often suffer from large mean squared error. In this work, we introduce a new approach named Maximum Likelihood Inverse Propensity Scoring (MLIPS) for batch learning from logged bandit feedback. Instead of using the given historical policy as the proposal in inverse propensity weights, we estimate a maximum likelihood surrogate policy based on the logged action-context pairs, and then use this surrogate policy as the proposal. We prove that MLIPS is asymptotically unbiased, and moreover, has a smaller nonasymptotic mean squared error than IPS. Such an error reduction phenomenon is somewhat surprising as the estimated surrogate policy is less accurate than the given historical policy. Results on multi-label classification problems and a large-scale ad placement dataset demonstrate the empirical effectiveness of MLIPS. Furthermore, the proposed surrogate policy technique is complementary to existing error reduction techniques, and when combined, is able to consistently boost the performance of several widely used approaches.

UR - http://www.scopus.com/inward/record.url?scp=85071169041&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85071169041&partnerID=8YFLogxK

M3 - Paper

AN - SCOPUS:85071169041

ER -