Robot sound interpretation: Combining sight and sound in learning-based control

Peixin Chang, Shuijing Liu, Haonan Chen, Katherine Driggs-Campbell

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

We explore the interpretation of sound for robot decision making, inspired by human speech comprehension. While previous methods separate sound processing unit and robot controller, we propose an end-to-end deep neural network which directly interprets sound commands for visual-based decision making. The network is trained using reinforcement learning with auxiliary losses on the sight and sound networks. We demonstrate our approach on two robots, a TurtleBot3 and a Kuka-IIWA arm, which hear a command word, identify the associated target object, and perform precise control to reach the target. For both robots, we show the effectiveness of our network in generalization to sound types and robotic tasks empirically. We successfully transfer the policy learned in simulator to a real-world TurtleBot3.

Original languageEnglish (US)
Title of host publication2020 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2020
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages5580-5587
Number of pages8
ISBN (Electronic)9781728162126
DOIs
StatePublished - Oct 24 2020
Event2020 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2020 - Las Vegas, United States
Duration: Oct 24 2020Jan 24 2021

Publication series

NameIEEE International Conference on Intelligent Robots and Systems
ISSN (Print)2153-0858
ISSN (Electronic)2153-0866

Conference

Conference2020 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2020
CountryUnited States
CityLas Vegas
Period10/24/201/24/21

ASJC Scopus subject areas

  • Control and Systems Engineering
  • Software
  • Computer Vision and Pattern Recognition
  • Computer Science Applications

Fingerprint Dive into the research topics of 'Robot sound interpretation: Combining sight and sound in learning-based control'. Together they form a unique fingerprint.

Cite this