Abstract
This article takes into account multiagent games during which the opponents can change policies and their policy sets are partially known. Our goal is to generate an effective policy such that our agent can obtain a higher reward and meanwhile guarantee bounded regret. Considering such games against nonstationary opponents with partially known policies, Exp3.P-based autonomous decision (EAD) algorithm is put forward, which contains three steps. First, we learn the embedding of the opponent’s policy via conditional encoder–decoder and employ conditional RL to generate the targeted policy. Second, we estimate the opponent policy through online Bayesian belief updates. Finally, we select the adversarial and targeted policy via a multiarmed bandit algorithm. Theoretical analysis is performed for the EAD algorithm. We give the lower bound of the expected reward when using the targeted policy and prove that the EAD algorithm has a bounded regret. Experimental results on Kuhn poker and Grid-world Predator–Prey show the effectiveness of the proposed EAD algorithm.
| Original language | English (US) |
|---|---|
| Pages (from-to) | 975-988 |
| Number of pages | 14 |
| Journal | IEEE Transactions on Games |
| Volume | 17 |
| Issue number | 4 |
| DOIs | |
| State | Published - 2025 |
Keywords
- Exp3.P-based autonomous decision (EAD)
- multiarmed bandits
- nonstationary opponents with partially known policies
- opponent modeling
ASJC Scopus subject areas
- Software
- Control and Systems Engineering
- Electrical and Electronic Engineering
- Artificial Intelligence