TY - GEN
T1 - A Translation Framework for Visually Grounded Spoken Unit Discovery
AU - Wang, Liming
AU - Hasegawa-Johnson, Mark
N1 - Publisher Copyright:
© 2021 IEEE.
PY - 2021
Y1 - 2021
N2 - Multimodal acoustic unit discovery (MAUD) is a key task in self-supervised spoken language learning and low-resource speech recognition. In this paper, we proposed two models for MAUD inspired by machine translation models where we treat speech and image as source and target languages. Our word discovery model outperforms previous state-of-the-art approach by 5.3% alignment F1 on SpeechCOCO dataset and our phoneme discovery model outperforms previous state-of-the-art approach by 7% normalized mutual information on TIMIT dataset.
AB - Multimodal acoustic unit discovery (MAUD) is a key task in self-supervised spoken language learning and low-resource speech recognition. In this paper, we proposed two models for MAUD inspired by machine translation models where we treat speech and image as source and target languages. Our word discovery model outperforms previous state-of-the-art approach by 5.3% alignment F1 on SpeechCOCO dataset and our phoneme discovery model outperforms previous state-of-the-art approach by 7% normalized mutual information on TIMIT dataset.
UR - http://www.scopus.com/inward/record.url?scp=85127025169&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85127025169&partnerID=8YFLogxK
U2 - 10.1109/IEEECONF53345.2021.9723367
DO - 10.1109/IEEECONF53345.2021.9723367
M3 - Conference contribution
AN - SCOPUS:85127025169
T3 - Conference Record - Asilomar Conference on Signals, Systems and Computers
SP - 1419
EP - 1425
BT - 55th Asilomar Conference on Signals, Systems and Computers, ACSSC 2021
A2 - Matthews, Michael B.
PB - IEEE Computer Society
T2 - 55th Asilomar Conference on Signals, Systems and Computers, ACSSC 2021
Y2 - 31 October 2021 through 3 November 2021
ER -