Abstract
This paper extends off-policy reinforcement learning to the multi-agent case in which a set of networked agents communicating with their neighbors according to a time-varying graph collaboratively evaluates and improves a target policy while following a distinct behavior policy. To this end, the paper develops a multi-agent version of emphatic temporal difference learning for off-policy policy evaluation, and proves convergence under linear function approximation. The paper then leverages this result, in conjunction with a novel multi-agent off-policy policy gradient theorem and recent work in both multi-agent on-policy and single-agent off-policy actor-critic methods, to develop and give convergence guarantees for a new multi-agent off-policy actor-critic algorithm. An empirical validation of these theoretical results is given.
Original language | English (US) |
---|---|
Pages (from-to) | 1549-1554 |
Number of pages | 6 |
Journal | IFAC-PapersOnLine |
Volume | 53 |
DOIs | |
State | Published - 2020 |
Event | 21st IFAC World Congress 2020 - Berlin, Germany Duration: Jul 12 2020 → Jul 17 2020 |
Keywords
- Adaptive control of multi-agent systems
- Consensus and reinforcement learning control
ASJC Scopus subject areas
- Control and Systems Engineering