TY - GEN
T1 - A Finite Sample Analysis of the Actor-Critic Algorithm
AU - Yang, Zhuoran
AU - Zhang, Kaiqing
AU - Hong, Mingyi
AU - Basar, Tamer
N1 - Publisher Copyright:
© 2018 IEEE.
PY - 2018/7/2
Y1 - 2018/7/2
N2 - We study the finite-sample performance of batch actor-critic algorithm for reinforcement learning with nonlinear function approximations. Specifically, in the critic step, we estimate the action-value function corresponding to the policy of the actor within some parametrized function class, while in the actor step, the policy is updated using the policy gradient estimated based on the critic, so as to minimize the objective function defined as the expected value of discounted cumulative rewards. Under this setting, for the parameter sequence created by the actor steps, we show that the gradient norm of the objective function at any limit point is close to zero up to some fundamental error. In particular, we show that the error corresponds to the statistical rate of policy evaluation with nonlinear function approximations. For the special class of linear functions and when the number of samples goes to infinity, our result recovers the classical convergence results for the online actor-critic algorithm, which is based on the asymptotic behavior of two-time-scale stochastic approximation.
AB - We study the finite-sample performance of batch actor-critic algorithm for reinforcement learning with nonlinear function approximations. Specifically, in the critic step, we estimate the action-value function corresponding to the policy of the actor within some parametrized function class, while in the actor step, the policy is updated using the policy gradient estimated based on the critic, so as to minimize the objective function defined as the expected value of discounted cumulative rewards. Under this setting, for the parameter sequence created by the actor steps, we show that the gradient norm of the objective function at any limit point is close to zero up to some fundamental error. In particular, we show that the error corresponds to the statistical rate of policy evaluation with nonlinear function approximations. For the special class of linear functions and when the number of samples goes to infinity, our result recovers the classical convergence results for the online actor-critic algorithm, which is based on the asymptotic behavior of two-time-scale stochastic approximation.
UR - http://www.scopus.com/inward/record.url?scp=85062191910&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85062191910&partnerID=8YFLogxK
U2 - 10.1109/CDC.2018.8619440
DO - 10.1109/CDC.2018.8619440
M3 - Conference contribution
AN - SCOPUS:85062191910
T3 - Proceedings of the IEEE Conference on Decision and Control
SP - 2759
EP - 2764
BT - 2018 IEEE Conference on Decision and Control, CDC 2018
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 57th IEEE Conference on Decision and Control, CDC 2018
Y2 - 17 December 2018 through 19 December 2018
ER -