Multimodal Embodied Plan Prediction Augmented with Synthetic Embodied Dialogue

Aishwarya Padmakumar, Mert Inan, Spandana Gella, Patrick Lange, Dilek Hakkani-Tur

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Embodied task completion is a challenge where an agent in a simulated environment must predict environment actions to complete tasks based on natural language instructions and ego-centric visual observations. We propose a variant of this problem where the agent predicts actions at a higher level of abstraction called a plan, which helps make agent actions more interpretable and can be obtained from the appropriate prompting of large language models. We show that multimodal transformer models can outperform language-only models for this problem but fall significantly short of oracle plans. Since collecting human-human dialogues for embodied environments is expensive and time-consuming, we propose a method to synthetically generate such dialogues, which we then use as training data for plan prediction. We demonstrate that multimodal transformer models can attain strong zero-shot performance from our synthetic data, outperforming language-only models trained on human-human data.
Original languageEnglish (US)
Title of host publicationProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
EditorsHouda Bouamor, Juan Pino, Kalika Bali
Place of PublicationSingapore
PublisherAssociation for Computational Linguistics
Pages6114-6131
Number of pages18
DOIs
StatePublished - Dec 2023

Fingerprint

Dive into the research topics of 'Multimodal Embodied Plan Prediction Augmented with Synthetic Embodied Dialogue'. Together they form a unique fingerprint.

Cite this