From minimax value to low-regret algorithms for online Markov decision processes

Peng Guan, Maxim Raginsky, Rebecca Willett

Research output: Chapter in Book/Report/Conference proceedingConference contribution


The standard Markov Decision Process (MDP) framework assumes a stationary (or at least predictable) environment. Online learning algorithms can deal with non-stationary or unpredictable environments, but there is no notion of a state that might be changing throughout the learning process as a function of past actions. In recent years, there has been a growing interest in combining the above two frameworks and considering an MDP setting, where the cost function is allowed to change arbitrarily after each time step. However, most of the work in this area has been algorithmic: given a problem, one would design an algorithm from scratch and analyze its performance on a case-by-case basis. Moreover, the presence of the state and the assumption of an arbitrarily varying environment complicate both the theoretical analysis and the development of computationally efficient methods. This paper builds on recent results of Rakhlin et al. to give a general framework for deriving algorithms in an MDP setting with arbitrarily changing costs. This framework leads to a unifying view of existing methods and provides a general procedure for constructing new ones.

Original languageEnglish (US)
Title of host publication2014 American Control Conference, ACC 2014
PublisherInstitute of Electrical and Electronics Engineers Inc.
Number of pages6
ISBN (Print)9781479932726
StatePublished - 2014
Event2014 American Control Conference, ACC 2014 - Portland, OR, United States
Duration: Jun 4 2014Jun 6 2014

Publication series

NameProceedings of the American Control Conference
ISSN (Print)0743-1619


Other2014 American Control Conference, ACC 2014
Country/TerritoryUnited States
CityPortland, OR


  • Machine learning
  • Markov processes

ASJC Scopus subject areas

  • Electrical and Electronic Engineering


Dive into the research topics of 'From minimax value to low-regret algorithms for online Markov decision processes'. Together they form a unique fingerprint.

Cite this