A Translation Framework for Visually Grounded Spoken Unit Discovery

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Multimodal acoustic unit discovery (MAUD) is a key task in self-supervised spoken language learning and low-resource speech recognition. In this paper, we proposed two models for MAUD inspired by machine translation models where we treat speech and image as source and target languages. Our word discovery model outperforms previous state-of-the-art approach by 5.3% alignment F1 on SpeechCOCO dataset and our phoneme discovery model outperforms previous state-of-the-art approach by 7% normalized mutual information on TIMIT dataset.

Original languageEnglish (US)
Title of host publication55th Asilomar Conference on Signals, Systems and Computers, ACSSC 2021
EditorsMichael B. Matthews
PublisherIEEE Computer Society
Pages1419-1425
Number of pages7
ISBN (Electronic)9781665458283
DOIs
StatePublished - 2021
Event55th Asilomar Conference on Signals, Systems and Computers, ACSSC 2021 - Virtual, Pacific Grove, United States
Duration: Oct 31 2021Nov 3 2021

Publication series

NameConference Record - Asilomar Conference on Signals, Systems and Computers
Volume2021-October
ISSN (Print)1058-6393

Conference

Conference55th Asilomar Conference on Signals, Systems and Computers, ACSSC 2021
Country/TerritoryUnited States
CityVirtual, Pacific Grove
Period10/31/2111/3/21

ASJC Scopus subject areas

  • Signal Processing
  • Computer Networks and Communications

Fingerprint

Dive into the research topics of 'A Translation Framework for Visually Grounded Spoken Unit Discovery'. Together they form a unique fingerprint.

Cite this