Abstract
Many bandit problems are characterized by the learner making decisions under constraints. The learner in Linear Contextual Bandits with Knapsacks (LinCBwK) receives a resource consumption vector in addition to a scalar reward in each time step which are both linear functions of the context corresponding to the chosen arm. For a fixed time horizon T, the goal of the learner is to maximize rewards while ensuring resource consumptions do not exceed a pre-specified budget. We present algorithms and characterize regret for LinCBwK in the smoothed setting where base context vectors are assumed to be perturbed by Gaussian noise. We consider both the stochastic and adversarial settings for the base contexts, and our analysis of stochastic LinCBwK can be viewed as a warm-up to the more challenging adversarial LinCBwK. For the stochastic setting, we obtain Op?Tq additive regret bounds compared to the best context dependent fixed policy. The analysis combines ideas for greedy parameter estimation in (Kannan et al., 2018; Sivakumar et al., 2020) and the primal-dual paradigm first explored in (Agrawal & Devanur, 2016; 2014a). Our main contribution is an algorithm with Oplog Tq competitive ratio relative to the best context dependent fixed policy for the adversarial setting. The algorithm for the adversarial setting employs ideas from the primal-dual framework (Agrawal & Devanur, 2016; 2014a) and a novel adaptation of the doubling trick (Immorlica et al., 2019).
Original language | English (US) |
---|---|
Pages (from-to) | 20253-20277 |
Number of pages | 25 |
Journal | Proceedings of Machine Learning Research |
Volume | 162 |
State | Published - 2022 |
Event | 39th International Conference on Machine Learning, ICML 2022 - Baltimore, United States Duration: Jul 17 2022 → Jul 23 2022 |
ASJC Scopus subject areas
- Artificial Intelligence
- Software
- Control and Systems Engineering
- Statistics and Probability