Learning to Control Renewal Processes with Bandit Feedback

Semih Cayci, Atilla Eryilmaz, R. Srikant

Research output: Contribution to journalArticlepeer-review

Abstract

We consider a bandit problem with K task types from which the controller activates one task at a time. Each task takes a random and possibly heavy-Tailed completion time, and a reward is obtained only after the task is completed. The task types are independent from each other, and have distinct and unknown distributions for completion time and reward. For a given time horizon τ , the goal of the controller is to schedule tasks adaptively so as to maximize the reward collected until τ expires. In addition, we allow the controller to interrupt a task and initiate a new one. In addition to the traditional exploration-exploitation dilemma, this interrupt mechanism introduces a new one: should the controller complete the task and get the reward, or interrupt the task for a possibly shorter and more rewarding alternative? We show that for all heavy-Tailed and some light-Tailed completion time distributions, this interruption mechanism improves the reward linearly over time. From a learning perspective, the interrupt mechanism necessitates implicitly learning statistics beyond the mean from truncated observations. For this purpose, we propose a robust learning algorithm named UCB-BwI based on the median-of-means estimator for possibly heavy-Tailed reward and completion time distributions. We show that, in a Karmed bandit setting with an arbitrary set of L possible interrupt times, UCB-BwI achieves O(K log(τ) + KL) regret. We also prove that the regret under any admissible policy is ?(K log(τ)), which implies that UCB-BwI is order optimal.

Original languageEnglish (US)
Pages (from-to)41-42
Number of pages2
JournalPerformance Evaluation Review
Volume47
Issue number1
DOIs
StatePublished - Dec 17 2019

Keywords

  • heavy-Tailed distributions
  • multi-Armed bandits
  • online learning
  • online scheduling
  • renewal theory
  • stochastic knapsack

ASJC Scopus subject areas

  • Software
  • Hardware and Architecture
  • Computer Networks and Communications

Fingerprint

Dive into the research topics of 'Learning to Control Renewal Processes with Bandit Feedback'. Together they form a unique fingerprint.

Cite this