TY - JOUR
T1 - Model-based multi-agent RL in zero-sum Markov games with near-optimal sample complexity
AU - Zhang, Kaiqing
AU - Kakade, Sham M.
AU - Başar, Tamer
AU - Yang, Lin F.
N1 - Publisher Copyright:
© 2020 Neural information processing systems foundation. All rights reserved.
PY - 2020
Y1 - 2020
N2 - Model-based reinforcement learning (RL), which finds an optimal policy using an empirical model, has long been recognized as one of the cornerstones of RL. It is especially suitable for multi-agent RL (MARL), as it naturally decouples the learning and the planning phases, and avoids the non-stationarity problem when all agents are improving their policies simultaneously using samples. Though intuitive and widely-used, the sample complexity of model-based MARL algorithms has been investigated relatively much less often. In this paper, we aim to address the fundamental open question about the sample complexity of model-based MARL. We study arguably the most basic MARL setting: two-player discounted zero-sum Markov games, given only access to a generative model of state transition. We show that model-based MARL achieves a sample complexity of Oe(|S||A||B|(1 - γ)-3ε-2) for finding the Nash equilibrium (NE) value up to some ε error, and the ε-NE policies, where γ is the discount factor, and S, A, B denote the state space, and the action spaces for the two agents. We also show that this method is near-minimax optimal with a tight dependence on 1 - γ and |S| by providing a lower bound of Ω(|S|(|A| + |B|)(1 - γ)-3ε-2). Our results justify the efficiency of this simple model-based approach in the multi-agent RL setting.
AB - Model-based reinforcement learning (RL), which finds an optimal policy using an empirical model, has long been recognized as one of the cornerstones of RL. It is especially suitable for multi-agent RL (MARL), as it naturally decouples the learning and the planning phases, and avoids the non-stationarity problem when all agents are improving their policies simultaneously using samples. Though intuitive and widely-used, the sample complexity of model-based MARL algorithms has been investigated relatively much less often. In this paper, we aim to address the fundamental open question about the sample complexity of model-based MARL. We study arguably the most basic MARL setting: two-player discounted zero-sum Markov games, given only access to a generative model of state transition. We show that model-based MARL achieves a sample complexity of Oe(|S||A||B|(1 - γ)-3ε-2) for finding the Nash equilibrium (NE) value up to some ε error, and the ε-NE policies, where γ is the discount factor, and S, A, B denote the state space, and the action spaces for the two agents. We also show that this method is near-minimax optimal with a tight dependence on 1 - γ and |S| by providing a lower bound of Ω(|S|(|A| + |B|)(1 - γ)-3ε-2). Our results justify the efficiency of this simple model-based approach in the multi-agent RL setting.
UR - http://www.scopus.com/inward/record.url?scp=85101217043&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85101217043&partnerID=8YFLogxK
M3 - Conference article
AN - SCOPUS:85101217043
SN - 1049-5258
VL - 2020-December
JO - Advances in Neural Information Processing Systems
JF - Advances in Neural Information Processing Systems
T2 - 34th Conference on Neural Information Processing Systems, NeurIPS 2020
Y2 - 6 December 2020 through 12 December 2020
ER -