TY - JOUR
T1 - Provably Efficient Reinforcement Learning in Decentralized General-Sum Markov Games
AU - Mao, Weichao
AU - Başar, Tamer
N1 - Funding Information:
Research leading to this work was supported in part by ONR MURI Grant N00014-16-1-2710 and in part by AFOSR Grant FA9550-19-1-0353. Data availability statement: Data sharing is not applicable to this article as no datasets were generated or analyzed during the current study. We thank Kaiqing Zhang and Lin F. Yang for helpful discussions.
Publisher Copyright:
© 2022, The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature.
PY - 2023/3
Y1 - 2023/3
N2 - This paper addresses the problem of learning an equilibrium efficiently in general-sum Markov games through decentralized multi-agent reinforcement learning. Given the fundamental difficulty of calculating a Nash equilibrium (NE), we instead aim at finding a coarse correlated equilibrium (CCE), a solution concept that generalizes NE by allowing possible correlations among the agents’ strategies. We propose an algorithm in which each agent independently runs optimistic V-learning (a variant of Q-learning) to efficiently explore the unknown environment, while using a stabilized online mirror descent (OMD) subroutine for policy updates. We show that the agents can find an ϵ-approximate CCE in at most O~ (H6SA/ ϵ2) episodes, where S is the number of states, A is the size of the largest individual action space, and H is the length of an episode. This appears to be the first sample complexity result for learning in generic general-sum Markov games. Our results rely on a novel investigation of an anytime high-probability regret bound for OMD with a dynamic learning rate and weighted regret, which would be of independent interest. One key feature of our algorithm is that it is decentralized, in the sense that each agent has access to only its local information, and is completely oblivious to the presence of others. This way, our algorithm can readily scale up to an arbitrary number of agents, without suffering from the exponential dependence on the number of agents.
AB - This paper addresses the problem of learning an equilibrium efficiently in general-sum Markov games through decentralized multi-agent reinforcement learning. Given the fundamental difficulty of calculating a Nash equilibrium (NE), we instead aim at finding a coarse correlated equilibrium (CCE), a solution concept that generalizes NE by allowing possible correlations among the agents’ strategies. We propose an algorithm in which each agent independently runs optimistic V-learning (a variant of Q-learning) to efficiently explore the unknown environment, while using a stabilized online mirror descent (OMD) subroutine for policy updates. We show that the agents can find an ϵ-approximate CCE in at most O~ (H6SA/ ϵ2) episodes, where S is the number of states, A is the size of the largest individual action space, and H is the length of an episode. This appears to be the first sample complexity result for learning in generic general-sum Markov games. Our results rely on a novel investigation of an anytime high-probability regret bound for OMD with a dynamic learning rate and weighted regret, which would be of independent interest. One key feature of our algorithm is that it is decentralized, in the sense that each agent has access to only its local information, and is completely oblivious to the presence of others. This way, our algorithm can readily scale up to an arbitrary number of agents, without suffering from the exponential dependence on the number of agents.
KW - Coarse correlated equilibrium
KW - Markov game
KW - Reinforcement learning
KW - Sample complexity
UR - http://www.scopus.com/inward/record.url?scp=85122373440&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85122373440&partnerID=8YFLogxK
U2 - 10.1007/s13235-021-00420-0
DO - 10.1007/s13235-021-00420-0
M3 - Article
AN - SCOPUS:85122373440
SN - 2153-0785
VL - 13
SP - 165
EP - 186
JO - Dynamic Games and Applications
JF - Dynamic Games and Applications
IS - 1
ER -