Abstract
Embodied task completion is a challenge where an agent in a simulated environment must predict environment actions to complete tasks based on natural language instructions and ego-centric visual observations. We propose a variant of this problem where the agent predicts actions at a higher level of abstraction called a plan, which helps make agent actions more interpretable and can be obtained from the appropriate prompting of large language models. We show that multimodal transformer models can outperform language-only models for this problem but fall significantly short of oracle plans. Since collecting human-human dialogues for embodied environments is expensive and time-consuming, we propose a method to synthetically generate such dialogues, which we then use as training data for plan prediction. We demonstrate that multimodal transformer models can attain strong zero-shot performance from our synthetic data, outperforming language-only models trained on human-human data.
Original language | English (US) |
---|---|
Title of host publication | Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing |
Editors | Houda Bouamor, Juan Pino, Kalika Bali |
Place of Publication | Singapore |
Publisher | Association for Computational Linguistics |
Pages | 6114-6131 |
Number of pages | 18 |
DOIs | |
State | Published - Dec 2023 |