TY - GEN
T1 - Learning Visual-Audio Representations for Voice-Controlled Robots
AU - Chang, Peixin
AU - Liu, Shuijing
AU - McPherson, D. Livingston
AU - Driggs-Campbell, Katherine
N1 - P. Chang, S. Liu, D. L. McPherson, and K. Driggs-Campbell are with the Department of Electrical and Computer Engineering at the University of Illinois at Urbana-Champaign. emails: {pchang17,sliu105,dlivm,krdc}@illinois.edu This work is supported by Agriculture and Food Research Initiative (AFRI) grant no. 2020-67021-32799/project accession no.1024178 from the USDA National Institute of Food and Agriculture.
PY - 2023
Y1 - 2023
N2 - Based on the recent advancements in representation learning, we propose a novel pipeline for task-oriented voice-controlled robots with raw sensor inputs. Previous methods rely on a large number of labels and task-specific reward functions. Not only can such an approach hardly be improved after the deployment, but also has limited generalization across robotic platforms and tasks. To address these problems, our pipeline first learns a visual-audio representation (VAR) that associates images and sound commands. Then the robot learns to fulfill the sound command via reinforcement learning using the reward generated by the VAR. We demonstrate our approach with various sound types, robots, and tasks. We show that our method outperforms previous work with much fewer labels. We show in both the simulated and real-world experiments that the system can self-improve in previously unseen scenarios given a reasonable number of newly labeled data.
AB - Based on the recent advancements in representation learning, we propose a novel pipeline for task-oriented voice-controlled robots with raw sensor inputs. Previous methods rely on a large number of labels and task-specific reward functions. Not only can such an approach hardly be improved after the deployment, but also has limited generalization across robotic platforms and tasks. To address these problems, our pipeline first learns a visual-audio representation (VAR) that associates images and sound commands. Then the robot learns to fulfill the sound command via reinforcement learning using the reward generated by the VAR. We demonstrate our approach with various sound types, robots, and tasks. We show that our method outperforms previous work with much fewer labels. We show in both the simulated and real-world experiments that the system can self-improve in previously unseen scenarios given a reasonable number of newly labeled data.
UR - https://www.scopus.com/pages/publications/85165941567
UR - https://www.scopus.com/pages/publications/85165941567#tab=citedBy
U2 - 10.1109/ICRA48891.2023.10161461
DO - 10.1109/ICRA48891.2023.10161461
M3 - Conference contribution
AN - SCOPUS:85165941567
T3 - Proceedings - IEEE International Conference on Robotics and Automation
SP - 9508
EP - 9514
BT - Proceedings - ICRA 2023
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2023 IEEE International Conference on Robotics and Automation, ICRA 2023
Y2 - 29 May 2023 through 2 June 2023
ER -