Online Learning for Markov Decision Processes in Nonstationary Environments: A Dynamic Regret Analysis

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

In an online Markov decision process (MDP) with time-varying reward functions, a decision maker has to take an action before knowing the current reward function at each time step. This problem has received many research interests because of its wide range of applications. The literature usually focuses on static regret analysis by comparing the total reward of the optimal offline stationary policy and that of the online policies. This paper studies a different measure, dynamic regret, which is the reward difference between the optimal offline (possibly nonstationary) policies and the online policies. The measure suits better the time-varying environment. To obtain a meaningful regret analysis, we introduce a notion of total variation for the time-varying reward functions and bound the dynamic regret using the total variation. We propose an online algorithm, Follow the Weighted Leader (FWL), and prove that its dynamic regret can be upper bounded by the total variation. We also prove a lower bound of dynamic regrets for any online algorithm. The lower bound matches the upper bound of FWL, demonstrating the optimality of the algorithm. Finally, we show via simulation that our algorithm FWL significantly outperforms the existing algorithms in literature.

Original languageEnglish (US)
Title of host publication2019 American Control Conference, ACC 2019
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages1232-1237
Number of pages6
ISBN (Electronic)9781538679265
DOIs
StatePublished - Jul 2019
Externally publishedYes
Event2019 American Control Conference, ACC 2019 - Philadelphia, United States
Duration: Jul 10 2019Jul 12 2019

Publication series

NameProceedings of the American Control Conference
Volume2019-July
ISSN (Print)0743-1619

Conference

Conference2019 American Control Conference, ACC 2019
Country/TerritoryUnited States
CityPhiladelphia
Period7/10/197/12/19

ASJC Scopus subject areas

  • Electrical and Electronic Engineering

Fingerprint

Dive into the research topics of 'Online Learning for Markov Decision Processes in Nonstationary Environments: A Dynamic Regret Analysis'. Together they form a unique fingerprint.

Cite this