Learning to control renewal processes with bandit feedback

Semih Cayci, Atilla Eryilmaz, R. Srikant

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

We consider a bandit problem with K task types from which the controller activates one task at a time. Each task takes a random and possibly heavy-tailed completion time, and a reward is obtained only after the task is completed. The task types are independent from each other, and have distinct and unknown distributions for completion time and reward. For a given time horizon τ, the goal of the controller is to schedule tasks adaptively so as to maximize the reward collected until τ expires. In addition, we allow the controller to interrupt a task and initiate a new one. In addition to the traditional exploration-exploitation dilemma, this interrupt mechanism introduces a new one: should the controller complete the task and get the reward, or interrupt the task for a possibly shorter and more rewarding alternative? We show that for all heavy-tailed and some light-tailed completion time distributions, this interruption mechanism improves the reward linearly over time. From a learning perspective, the interrupt mechanism necessitates implicitly learning statistics beyond the mean from truncated observations. For this purpose, we propose a robust learning algorithm named UCB-BwI based on the median-of-means estimator for possibly heavy-tailed reward and completion time distributions. We show that, in a Karmed bandit setting with an arbitrary set of L possible interrupt times, UCB-BwI achieves O(K log(τ ) + KL) regret. We also prove that the regret under any admissible policy is ω(K log(τ )), which implies that UCB-BwI is order optimal.

Original languageEnglish (US)
Title of host publicationSIGMETRICS Performance 2019 - Abstracts of the 2019 SIGMETRICS/Performance Joint International Conference on Measurement and Modeling of Computer Systems
PublisherAssociation for Computing Machinery, Inc
Pages41-42
Number of pages2
ISBN (Electronic)9781450366786
DOIs
StatePublished - Jun 20 2019
Event14th Joint Conference of International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS 2019 and IFIP Performance Conference 2019, SIGMETRICS/Performance 2019 - Phoenix, United States
Duration: Jun 24 2019Jun 28 2019

Publication series

NameSIGMETRICS Performance 2019 - Abstracts of the 2019 SIGMETRICS/Performance Joint International Conference on Measurement and Modeling of Computer Systems

Conference

Conference14th Joint Conference of International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS 2019 and IFIP Performance Conference 2019, SIGMETRICS/Performance 2019
CountryUnited States
CityPhoenix
Period6/24/196/28/19

    Fingerprint

Keywords

  • Heavy-tailed distributions
  • Multi-armed bandits
  • Online learning
  • Online scheduling
  • Renewal theory
  • Stochastic knapsack

ASJC Scopus subject areas

  • Hardware and Architecture
  • Computer Networks and Communications
  • Computational Theory and Mathematics

Cite this

Cayci, S., Eryilmaz, A., & Srikant, R. (2019). Learning to control renewal processes with bandit feedback. In SIGMETRICS Performance 2019 - Abstracts of the 2019 SIGMETRICS/Performance Joint International Conference on Measurement and Modeling of Computer Systems (pp. 41-42). (SIGMETRICS Performance 2019 - Abstracts of the 2019 SIGMETRICS/Performance Joint International Conference on Measurement and Modeling of Computer Systems). Association for Computing Machinery, Inc. https://doi.org/10.1145/3309697.3331515