TY - JOUR
T1 - A Data-Efficient Visual-Audio Representation with Intuitive Fine-tuning for Voice-Controlled Robots
AU - Chang, Peixin
AU - Liu, Shuijing
AU - Ji, Tianchen
AU - Chakraborty, Neeloy
AU - Hong, Kaiwen
AU - Driggs-Campbell, Katherine
N1 - This work is supported by AIFARMS through the Agriculture and Food Research Initiative (AFRI) grant no. 2020-67021-32799/project accession no.1024178 from the USDA National Institute of Food and Agriculture. We thank Yunzhu Li and Karen Livescu for insightful discussions and all reviewers for their feedback.
PY - 2023
Y1 - 2023
N2 - A command-following robot that serves people in everyday life must continually improve itself in deployment domains with minimal help from its end users, instead of engineers. Previous methods are either difficult to continuously improve after the deployment or require a large number of new labels during fine-tuning. Motivated by (self-)supervised contrastive learning, we propose a novel representation that generates an intrinsic reward function for command-following robot tasks by associating images with sound commands. After the robot is deployed in a new domain, the representation can be updated intuitively and data-efficiently by non-experts without any hand-crafted reward functions. We demonstrate our approach on various sound types and robotic tasks, including navigation and manipulation with raw sensor inputs. In simulated and real-world experiments, we show that our system can continually self-improve in previously unseen scenarios given fewer new labeled data, while still achieving better performance over previous methods.
AB - A command-following robot that serves people in everyday life must continually improve itself in deployment domains with minimal help from its end users, instead of engineers. Previous methods are either difficult to continuously improve after the deployment or require a large number of new labels during fine-tuning. Motivated by (self-)supervised contrastive learning, we propose a novel representation that generates an intrinsic reward function for command-following robot tasks by associating images with sound commands. After the robot is deployed in a new domain, the representation can be updated intuitively and data-efficiently by non-experts without any hand-crafted reward functions. We demonstrate our approach on various sound types and robotic tasks, including navigation and manipulation with raw sensor inputs. In simulated and real-world experiments, we show that our system can continually self-improve in previously unseen scenarios given fewer new labeled data, while still achieving better performance over previous methods.
KW - Command Following
KW - Human-in-the-Loop
KW - Multimodal Representation
KW - Reinforcement Learning
UR - http://www.scopus.com/inward/record.url?scp=85184352585&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85184352585&partnerID=8YFLogxK
M3 - Conference article
AN - SCOPUS:85184352585
SN - 2640-3498
VL - 229
JO - Proceedings of Machine Learning Research
JF - Proceedings of Machine Learning Research
T2 - 7th Conference on Robot Learning, CoRL 2023
Y2 - 6 November 2023 through 9 November 2023
ER -