Learning self-imitating diverse policies

Tanmay Gangwani, Qiang Liu, Jian Peng

Research output: Contribution to conferencePaper

Abstract

The success of popular algorithms for deep reinforcement learning, such as policy-gradients and Q-learning, relies heavily on the availability of an informative reward signal at each timestep of the sequential decision-making process. When rewards are only sparsely available during an episode, or a rewarding feedback is provided only after episode termination, these algorithms perform sub-optimally due to the difficultly in credit assignment. Alternatively, trajectory-based policy optimization methods, such as cross-entropy method and evolution strategies, do not require per-timestep rewards, but have been found to suffer from high sample complexity by completing forgoing the temporal nature of the problem. Improving the efficiency of RL algorithms in real-world problems with sparse or episodic rewards is therefore a pressing need. In this work, we introduce a self-imitation learning algorithm that exploits and explores well in the sparse and episodic reward settings. We view each policy as a state-action visitation distribution and formulate policy optimization as a divergence minimization problem. We show that with Jensen-Shannon divergence, this divergence minimization problem can be reduced into a policy-gradient algorithm with shaped rewards learned from experience replays. Experimental results indicate that our algorithm works comparable to existing algorithms in environments with dense rewards, and significantly better in environments with sparse and episodic rewards. We then discuss limitations of self-imitation learning, and propose to solve them by using Stein variational policy gradient descent with the Jensen-Shannon kernel to learn multiple diverse policies. We demonstrate its effectiveness on a challenging variant of continuous-control MuJoCo locomotion tasks.

Original languageEnglish (US)
StatePublished - Jan 1 2019
Event7th International Conference on Learning Representations, ICLR 2019 - New Orleans, United States
Duration: May 6 2019May 9 2019

Conference

Conference7th International Conference on Learning Representations, ICLR 2019
CountryUnited States
CityNew Orleans
Period5/6/195/9/19

Fingerprint

reward
learning
divergence
imitation
Reinforcement learning
Learning algorithms
Reward
Entropy
entropy
Decision making
Trajectories
Availability
reinforcement
decision-making process
Feedback
credit
efficiency
Divergence
experience

ASJC Scopus subject areas

  • Education
  • Computer Science Applications
  • Linguistics and Language
  • Language and Linguistics

Cite this

Gangwani, T., Liu, Q., & Peng, J. (2019). Learning self-imitating diverse policies. Paper presented at 7th International Conference on Learning Representations, ICLR 2019, New Orleans, United States.

Learning self-imitating diverse policies. / Gangwani, Tanmay; Liu, Qiang; Peng, Jian.

2019. Paper presented at 7th International Conference on Learning Representations, ICLR 2019, New Orleans, United States.

Research output: Contribution to conferencePaper

Gangwani, T, Liu, Q & Peng, J 2019, 'Learning self-imitating diverse policies' Paper presented at 7th International Conference on Learning Representations, ICLR 2019, New Orleans, United States, 5/6/19 - 5/9/19, .
Gangwani T, Liu Q, Peng J. Learning self-imitating diverse policies. 2019. Paper presented at 7th International Conference on Learning Representations, ICLR 2019, New Orleans, United States.
Gangwani, Tanmay ; Liu, Qiang ; Peng, Jian. / Learning self-imitating diverse policies. Paper presented at 7th International Conference on Learning Representations, ICLR 2019, New Orleans, United States.
@conference{801dcaf06cca4776801f1f422d1129ae,
title = "Learning self-imitating diverse policies",
abstract = "The success of popular algorithms for deep reinforcement learning, such as policy-gradients and Q-learning, relies heavily on the availability of an informative reward signal at each timestep of the sequential decision-making process. When rewards are only sparsely available during an episode, or a rewarding feedback is provided only after episode termination, these algorithms perform sub-optimally due to the difficultly in credit assignment. Alternatively, trajectory-based policy optimization methods, such as cross-entropy method and evolution strategies, do not require per-timestep rewards, but have been found to suffer from high sample complexity by completing forgoing the temporal nature of the problem. Improving the efficiency of RL algorithms in real-world problems with sparse or episodic rewards is therefore a pressing need. In this work, we introduce a self-imitation learning algorithm that exploits and explores well in the sparse and episodic reward settings. We view each policy as a state-action visitation distribution and formulate policy optimization as a divergence minimization problem. We show that with Jensen-Shannon divergence, this divergence minimization problem can be reduced into a policy-gradient algorithm with shaped rewards learned from experience replays. Experimental results indicate that our algorithm works comparable to existing algorithms in environments with dense rewards, and significantly better in environments with sparse and episodic rewards. We then discuss limitations of self-imitation learning, and propose to solve them by using Stein variational policy gradient descent with the Jensen-Shannon kernel to learn multiple diverse policies. We demonstrate its effectiveness on a challenging variant of continuous-control MuJoCo locomotion tasks.",
author = "Tanmay Gangwani and Qiang Liu and Jian Peng",
year = "2019",
month = "1",
day = "1",
language = "English (US)",
note = "7th International Conference on Learning Representations, ICLR 2019 ; Conference date: 06-05-2019 Through 09-05-2019",

}

TY - CONF

T1 - Learning self-imitating diverse policies

AU - Gangwani, Tanmay

AU - Liu, Qiang

AU - Peng, Jian

PY - 2019/1/1

Y1 - 2019/1/1

N2 - The success of popular algorithms for deep reinforcement learning, such as policy-gradients and Q-learning, relies heavily on the availability of an informative reward signal at each timestep of the sequential decision-making process. When rewards are only sparsely available during an episode, or a rewarding feedback is provided only after episode termination, these algorithms perform sub-optimally due to the difficultly in credit assignment. Alternatively, trajectory-based policy optimization methods, such as cross-entropy method and evolution strategies, do not require per-timestep rewards, but have been found to suffer from high sample complexity by completing forgoing the temporal nature of the problem. Improving the efficiency of RL algorithms in real-world problems with sparse or episodic rewards is therefore a pressing need. In this work, we introduce a self-imitation learning algorithm that exploits and explores well in the sparse and episodic reward settings. We view each policy as a state-action visitation distribution and formulate policy optimization as a divergence minimization problem. We show that with Jensen-Shannon divergence, this divergence minimization problem can be reduced into a policy-gradient algorithm with shaped rewards learned from experience replays. Experimental results indicate that our algorithm works comparable to existing algorithms in environments with dense rewards, and significantly better in environments with sparse and episodic rewards. We then discuss limitations of self-imitation learning, and propose to solve them by using Stein variational policy gradient descent with the Jensen-Shannon kernel to learn multiple diverse policies. We demonstrate its effectiveness on a challenging variant of continuous-control MuJoCo locomotion tasks.

AB - The success of popular algorithms for deep reinforcement learning, such as policy-gradients and Q-learning, relies heavily on the availability of an informative reward signal at each timestep of the sequential decision-making process. When rewards are only sparsely available during an episode, or a rewarding feedback is provided only after episode termination, these algorithms perform sub-optimally due to the difficultly in credit assignment. Alternatively, trajectory-based policy optimization methods, such as cross-entropy method and evolution strategies, do not require per-timestep rewards, but have been found to suffer from high sample complexity by completing forgoing the temporal nature of the problem. Improving the efficiency of RL algorithms in real-world problems with sparse or episodic rewards is therefore a pressing need. In this work, we introduce a self-imitation learning algorithm that exploits and explores well in the sparse and episodic reward settings. We view each policy as a state-action visitation distribution and formulate policy optimization as a divergence minimization problem. We show that with Jensen-Shannon divergence, this divergence minimization problem can be reduced into a policy-gradient algorithm with shaped rewards learned from experience replays. Experimental results indicate that our algorithm works comparable to existing algorithms in environments with dense rewards, and significantly better in environments with sparse and episodic rewards. We then discuss limitations of self-imitation learning, and propose to solve them by using Stein variational policy gradient descent with the Jensen-Shannon kernel to learn multiple diverse policies. We demonstrate its effectiveness on a challenging variant of continuous-control MuJoCo locomotion tasks.

UR - http://www.scopus.com/inward/record.url?scp=85071170699&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85071170699&partnerID=8YFLogxK

M3 - Paper

AN - SCOPUS:85071170699

ER -