TY - GEN
T1 - Learning to control renewal processes with bandit feedback
AU - Cayci, Semih
AU - Eryilmaz, Atilla
AU - Srikant, R.
N1 - Publisher Copyright:
© 2019 Copyright held by the owner/author(s).
PY - 2019/6/20
Y1 - 2019/6/20
N2 - We consider a bandit problem with K task types from which the controller activates one task at a time. Each task takes a random and possibly heavy-tailed completion time, and a reward is obtained only after the task is completed. The task types are independent from each other, and have distinct and unknown distributions for completion time and reward. For a given time horizon τ, the goal of the controller is to schedule tasks adaptively so as to maximize the reward collected until τ expires. In addition, we allow the controller to interrupt a task and initiate a new one. In addition to the traditional exploration-exploitation dilemma, this interrupt mechanism introduces a new one: should the controller complete the task and get the reward, or interrupt the task for a possibly shorter and more rewarding alternative? We show that for all heavy-tailed and some light-tailed completion time distributions, this interruption mechanism improves the reward linearly over time. From a learning perspective, the interrupt mechanism necessitates implicitly learning statistics beyond the mean from truncated observations. For this purpose, we propose a robust learning algorithm named UCB-BwI based on the median-of-means estimator for possibly heavy-tailed reward and completion time distributions. We show that, in a Karmed bandit setting with an arbitrary set of L possible interrupt times, UCB-BwI achieves O(K log(τ ) + KL) regret. We also prove that the regret under any admissible policy is ω(K log(τ )), which implies that UCB-BwI is order optimal.
AB - We consider a bandit problem with K task types from which the controller activates one task at a time. Each task takes a random and possibly heavy-tailed completion time, and a reward is obtained only after the task is completed. The task types are independent from each other, and have distinct and unknown distributions for completion time and reward. For a given time horizon τ, the goal of the controller is to schedule tasks adaptively so as to maximize the reward collected until τ expires. In addition, we allow the controller to interrupt a task and initiate a new one. In addition to the traditional exploration-exploitation dilemma, this interrupt mechanism introduces a new one: should the controller complete the task and get the reward, or interrupt the task for a possibly shorter and more rewarding alternative? We show that for all heavy-tailed and some light-tailed completion time distributions, this interruption mechanism improves the reward linearly over time. From a learning perspective, the interrupt mechanism necessitates implicitly learning statistics beyond the mean from truncated observations. For this purpose, we propose a robust learning algorithm named UCB-BwI based on the median-of-means estimator for possibly heavy-tailed reward and completion time distributions. We show that, in a Karmed bandit setting with an arbitrary set of L possible interrupt times, UCB-BwI achieves O(K log(τ ) + KL) regret. We also prove that the regret under any admissible policy is ω(K log(τ )), which implies that UCB-BwI is order optimal.
KW - Heavy-tailed distributions
KW - Multi-armed bandits
KW - Online learning
KW - Online scheduling
KW - Renewal theory
KW - Stochastic knapsack
UR - http://www.scopus.com/inward/record.url?scp=85069189480&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85069189480&partnerID=8YFLogxK
U2 - 10.1145/3309697.3331515
DO - 10.1145/3309697.3331515
M3 - Conference contribution
AN - SCOPUS:85069189480
T3 - SIGMETRICS Performance 2019 - Abstracts of the 2019 SIGMETRICS/Performance Joint International Conference on Measurement and Modeling of Computer Systems
SP - 41
EP - 42
BT - SIGMETRICS Performance 2019 - Abstracts of the 2019 SIGMETRICS/Performance Joint International Conference on Measurement and Modeling of Computer Systems
PB - Association for Computing Machinery, Inc
T2 - 14th Joint Conference of International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS 2019 and IFIP Performance Conference 2019, SIGMETRICS/Performance 2019
Y2 - 24 June 2019 through 28 June 2019
ER -