TY - GEN
T1 - Monte-Carlo tree search for policy optimization
AU - Ma, Xiaobai
AU - Driggs-Campbell, Katherine
AU - Zhang, Zongzhang
AU - Kochenderfer, Mykel J.
N1 - Funding Information:
We thank anonymous reviewers for their helpful feedback and suggestions. This work is sponsored through the Stanford Center for AI Safety. Zongzhang Zhang is in part supported by the National Natural Science Foundation of China under Grant No. 61876119, and the Natural Science Foundation of Jiangsu under Grant No. BK20181432, and the China Schol-
Funding Information:
We thank anonymous reviewers for their helpful feedback and suggestions. This work is sponsored through the Stanford Center for AI Safety. Zongzhang Zhang is in part supported by the National Natural Science Foundation of China under Grant No. 61876119, and the Natural Science Foundation of Jiangsu under Grant No. BK20181432, and the China Scholarship Council.
Publisher Copyright:
© 2019 International Joint Conferences on Artificial Intelligence. All rights reserved.
PY - 2019
Y1 - 2019
N2 - Gradient-based methods are often used for policy optimization in deep reinforcement learning, despite being vulnerable to local optima and saddle points. Although gradient-free methods (e.g., genetic algorithms or evolution strategies) help mitigate these issues, poor initialization and local optima are still concerns in highly nonconvex spaces. This paper presents a method for policy optimization based on Monte-Carlo tree search and gradient-free optimization. Our method, called Monte-Carlo tree search for policy optimization (MCTSPO), provides a better exploration-exploitation trade-off through the use of the upper confidence bound heuristic. We demonstrate improved performance on reinforcement learning tasks with deceptive or sparse reward functions compared to popular gradient-based and deep genetic algorithm baselines.
AB - Gradient-based methods are often used for policy optimization in deep reinforcement learning, despite being vulnerable to local optima and saddle points. Although gradient-free methods (e.g., genetic algorithms or evolution strategies) help mitigate these issues, poor initialization and local optima are still concerns in highly nonconvex spaces. This paper presents a method for policy optimization based on Monte-Carlo tree search and gradient-free optimization. Our method, called Monte-Carlo tree search for policy optimization (MCTSPO), provides a better exploration-exploitation trade-off through the use of the upper confidence bound heuristic. We demonstrate improved performance on reinforcement learning tasks with deceptive or sparse reward functions compared to popular gradient-based and deep genetic algorithm baselines.
UR - http://www.scopus.com/inward/record.url?scp=85074948194&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85074948194&partnerID=8YFLogxK
U2 - 10.24963/ijcai.2019/432
DO - 10.24963/ijcai.2019/432
M3 - Conference contribution
AN - SCOPUS:85074948194
T3 - IJCAI International Joint Conference on Artificial Intelligence
SP - 3116
EP - 3122
BT - Proceedings of the 28th International Joint Conference on Artificial Intelligence, IJCAI 2019
A2 - Kraus, Sarit
PB - International Joint Conferences on Artificial Intelligence
T2 - 28th International Joint Conference on Artificial Intelligence, IJCAI 2019
Y2 - 10 August 2019 through 16 August 2019
ER -