TY - GEN
T1 - Online Learning for Markov Decision Processes in Nonstationary Environments: A Dynamic Regret Analysis
AU - Li, Yingying
AU - Li, Na
N1 - Funding Information:
The work was supported by NSF 1608509, NSF CAREER 1553407, AFOSR YIP, and ARPA-E through the NODES program. Y. Li and N. Li are with the School of Engineering and Applied Sciences, Harvard University, 33 Oxford Street, Cambridge, MA 02138, USA (email: [email protected], [email protected]).
Publisher Copyright:
© 2019 American Automatic Control Council.
PY - 2019/7
Y1 - 2019/7
N2 - In an online Markov decision process (MDP) with time-varying reward functions, a decision maker has to take an action before knowing the current reward function at each time step. This problem has received many research interests because of its wide range of applications. The literature usually focuses on static regret analysis by comparing the total reward of the optimal offline stationary policy and that of the online policies. This paper studies a different measure, dynamic regret, which is the reward difference between the optimal offline (possibly nonstationary) policies and the online policies. The measure suits better the time-varying environment. To obtain a meaningful regret analysis, we introduce a notion of total variation for the time-varying reward functions and bound the dynamic regret using the total variation. We propose an online algorithm, Follow the Weighted Leader (FWL), and prove that its dynamic regret can be upper bounded by the total variation. We also prove a lower bound of dynamic regrets for any online algorithm. The lower bound matches the upper bound of FWL, demonstrating the optimality of the algorithm. Finally, we show via simulation that our algorithm FWL significantly outperforms the existing algorithms in literature.
AB - In an online Markov decision process (MDP) with time-varying reward functions, a decision maker has to take an action before knowing the current reward function at each time step. This problem has received many research interests because of its wide range of applications. The literature usually focuses on static regret analysis by comparing the total reward of the optimal offline stationary policy and that of the online policies. This paper studies a different measure, dynamic regret, which is the reward difference between the optimal offline (possibly nonstationary) policies and the online policies. The measure suits better the time-varying environment. To obtain a meaningful regret analysis, we introduce a notion of total variation for the time-varying reward functions and bound the dynamic regret using the total variation. We propose an online algorithm, Follow the Weighted Leader (FWL), and prove that its dynamic regret can be upper bounded by the total variation. We also prove a lower bound of dynamic regrets for any online algorithm. The lower bound matches the upper bound of FWL, demonstrating the optimality of the algorithm. Finally, we show via simulation that our algorithm FWL significantly outperforms the existing algorithms in literature.
UR - http://www.scopus.com/inward/record.url?scp=85072279634&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85072279634&partnerID=8YFLogxK
U2 - 10.23919/acc.2019.8815000
DO - 10.23919/acc.2019.8815000
M3 - Conference contribution
AN - SCOPUS:85072279634
T3 - Proceedings of the American Control Conference
SP - 1232
EP - 1237
BT - 2019 American Control Conference, ACC 2019
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2019 American Control Conference, ACC 2019
Y2 - 10 July 2019 through 12 July 2019
ER -