TY - JOUR
T1 - One Policy is Enough
T2 - 26th International Conference on Artificial Intelligence and Statistics, AISTATS 2023
AU - Cisneros-Velarde, Pedro
AU - Lyu, Boxiang
AU - Koyejo, Sanmi
AU - Kolar, Mladen
N1 - We are grateful to all the reviewers and the meta-reviewer for their time and their comments to improve our paper. This work is partially supported by NSF III 2046795, IIS 1909577, CCF 1934986, NIH 1R01MH116226-01A, NIFA award 2020-67021-32799, the Alfred P. Sloan Foundation, and Google Inc. This work is also partially supported by the William S. Fishman Faculty Research Fund at the University of Chicago Booth School of Business.
PY - 2023
Y1 - 2023
N2 - Although parallelism has been extensively used in reinforcement learning (RL), the quantitative effects of parallel exploration are not well understood theoretically. We study the benefits of simple parallel exploration for reward-free RL in linear Markov decision processes (MDPs) and two-player zero-sum Markov games (MGs). In contrast to the existing literature, which focuses on approaches that encourage agents to explore a diverse set of policies, we show that using a single policy to guide exploration across all agents is sufficient to obtain an almost-linear speedup in all cases compared to their fully sequential counterpart. Furthermore, we demonstrate that this simple procedure is near-minimax optimal in the reward-free setting for linear MDPs. From a practical perspective, our paper shows that a single policy is sufficient and provably near-optimal for incorporating parallelism during the exploration phase.
AB - Although parallelism has been extensively used in reinforcement learning (RL), the quantitative effects of parallel exploration are not well understood theoretically. We study the benefits of simple parallel exploration for reward-free RL in linear Markov decision processes (MDPs) and two-player zero-sum Markov games (MGs). In contrast to the existing literature, which focuses on approaches that encourage agents to explore a diverse set of policies, we show that using a single policy to guide exploration across all agents is sufficient to obtain an almost-linear speedup in all cases compared to their fully sequential counterpart. Furthermore, we demonstrate that this simple procedure is near-minimax optimal in the reward-free setting for linear MDPs. From a practical perspective, our paper shows that a single policy is sufficient and provably near-optimal for incorporating parallelism during the exploration phase.
UR - http://www.scopus.com/inward/record.url?scp=85165200970&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85165200970&partnerID=8YFLogxK
M3 - Conference article
AN - SCOPUS:85165200970
SN - 2640-3498
VL - 206
SP - 1965
EP - 2001
JO - Proceedings of Machine Learning Research
JF - Proceedings of Machine Learning Research
Y2 - 25 April 2023 through 27 April 2023
ER -